Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

shuai ma

Abstract As science communities and commercial organisations increasingly exploit XML as a means of exchanging and disseminating information, the selective exposure of information in XML has become an important issue. In this paper, we... more
Abstract As science communities and commercial organisations increasingly exploit XML as a means of exchanging and disseminating information, the selective exposure of information in XML has become an important issue. In this paper, we present a novel XML security framework developed to enforce a generic, flexible access-control mechanism for XML data management, which supports efficient and secure query access, without revealing sensitive information to unauthorized users.
Abstract Data quality is one of the most important problems in data management. A database system typically aims to support the creation, maintenance, and use of large amount of data, focusing on the quantity of data. However, real-life... more
Abstract Data quality is one of the most important problems in data management. A database system typically aims to support the creation, maintenance, and use of large amount of data, focusing on the quantity of data. However, real-life data are often dirty: inconsistent, duplicated, inaccurate, incomplete, or stale. Dirty data in a database routinely generate misleading or biased analytical results and decisions, and lead to loss of revenues, credibility and customers.
Methods and apparatus are provided for detecting data inconsistencies. Methods are disclosed for determining whether a set of conditional functional dependencies are consistent; determining a minimal cover of a set of conditional... more
Methods and apparatus are provided for detecting data inconsistencies. Methods are disclosed for determining whether a set of conditional functional dependencies are consistent; determining a minimal cover of a set of conditional functional dependencies and detecting a violation of one or more conditional functional dependencies in a set of conditional functional dependencies.
Abstract. We demonstrate iMONDRIAN, a component of the MONDRIAN annotation management system. Distinguishing features of MONDRIAN are (i) the ability to annotate sets of values (ii) the annotation-aware query algebra. On top of that,... more
Abstract. We demonstrate iMONDRIAN, a component of the MONDRIAN annotation management system. Distinguishing features of MONDRIAN are (i) the ability to annotate sets of values (ii) the annotation-aware query algebra. On top of that, iMONDRIAN offers an intuitive visual interface to annotate and query scientific databases.
Abstract This article investigates the question of whether a partially closed database has complete information to answer a query. In practice an enterprise often maintains master data D m, a closed-world database. We say that a database... more
Abstract This article investigates the question of whether a partially closed database has complete information to answer a query. In practice an enterprise often maintains master data D m, a closed-world database. We say that a database D is partially closed if it satisfies a set V of containment constraints of the form q (D) &subse; p (D m), where q is a query in a language LC and p is a projection query. The part of D not constrained by (D m, V) is open, from which some tuples may be missing.
We show that there is a query expressible in first-order logic over the reals that returns, on any given semi-algebraic set A, for every point, a radius around which A is conical in every small enough box. We obtain this result by... more
We show that there is a query expressible in first-order logic over the reals that returns, on any given semi-algebraic set A, for every point, a radius around which A is conical in every small enough box. We obtain this result by combining results from differential topology and real algebraic geometry, with recent algorithmic results by Rannou.
Abstract We present Semandaq, a prototype system for improving the quality of relational data. Based on the recently proposed conditional functional dependencies (CFDs), it detects and repairs errors and inconsistencies that emerge as... more
Abstract We present Semandaq, a prototype system for improving the quality of relational data. Based on the recently proposed conditional functional dependencies (CFDs), it detects and repairs errors and inconsistencies that emerge as violations of these constraints.
Abstract We study the satisfiability problem associated with XPath in the presence of DTDs. This is the problem of determining, given a query p in an XPath fragment and a DTD D, whether or not there exists an XML document T such that T... more
Abstract We study the satisfiability problem associated with XPath in the presence of DTDs. This is the problem of determining, given a query p in an XPath fragment and a DTD D, whether or not there exists an XML document T such that T conforms to D and the answer of p on T is nonempty. We consider a variety of XPath fragments widely used in practice, and investigate the impact of different XPath operators on the satisfiability analysis.
Abstract We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for... more
Abstract We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.
When transforming data one often wants certain information in the data source to be preserved, ie, we identify parts of the source data and require these parts to be transformed without loss of information. We characterize the... more
When transforming data one often wants certain information in the data source to be preserved, ie, we identify parts of the source data and require these parts to be transformed without loss of information. We characterize the preservation of selected information in terms of the notions of invertibility and query preservation, in a setting when transformations are specified as a view V (a set of queries), and source information is selected by a query Q.
Abstract This paper investigates three problems identified in [1] for annotation propagation, namely, the view side-effect, source side-effect, and annotation placement problems. Given annotations entered for a tuple or an attribute in a... more
Abstract This paper investigates three problems identified in [1] for annotation propagation, namely, the view side-effect, source side-effect, and annotation placement problems. Given annotations entered for a tuple or an attribute in a view, these problems ask what tuples or attributes in the source have to be annotated to produce the view annotations. As observed in [1], these problems are fundamental not only for data provenance but also for the management of view updates.
Abstract We address a fundamental question concerning spatio-temporal database systems:“What are exactly spatio-temporal queries?” We define spatio-temporal queries to be computable mappings that are also generic, meaning that the... more
Abstract We address a fundamental question concerning spatio-temporal database systems:“What are exactly spatio-temporal queries?” We define spatio-temporal queries to be computable mappings that are also generic, meaning that the result of a query may only depend to a limited extent on the actual internal representation of the spatio-temporal data.
Abstract Databases in real life are often neither entirely closed-world nor entirely open-world. Indeed, databases in an enterprise are typically partially closed, in which a part of the data is constrained by master data that contains... more
Abstract Databases in real life are often neither entirely closed-world nor entirely open-world. Indeed, databases in an enterprise are typically partially closed, in which a part of the data is constrained by master data that contains complete information about the enterprise in certain aspects [21]. It has been shown that despite missing tuples, such a database may turn out to have complete information for answering a query [9].
Abstract We consider spatial databases in the plane that can be defined by polynomial constraint formulas. Motivated by applications in geographic information systems, we investigate linear approximations of spatial databases and study in... more
Abstract We consider spatial databases in the plane that can be defined by polynomial constraint formulas. Motivated by applications in geographic information systems, we investigate linear approximations of spatial databases and study in which language they can be expressed effectively.
We consider n-dimensional semi-algebraic spatial databases. We compute in first-order logic extended with a transitive closure operator, a linear spatial database which characterizes the semi-algebraic spatial database up to a... more
We consider n-dimensional semi-algebraic spatial databases. We compute in first-order logic extended with a transitive closure operator, a linear spatial database which characterizes the semi-algebraic spatial database up to a homeomorphism. In this way, we generalize our earlier results to semi-algebraic spatial databases in arbitrary dimensions, our earlier results being true for only two dimensions.
This paper addresses the question whether one can determine the connectivity of a semi-algebraic set in three dimensions by looking only at two-dimensional “samples” of the set, where these samples are defined by first-order queries. The... more
This paper addresses the question whether one can determine the connectivity of a semi-algebraic set in three dimensions by looking only at two-dimensional “samples” of the set, where these samples are defined by first-order queries. The question is answered negatively for two classes of first-order queries: cartesian-product-free, and positive one-pass.
The formalism of constraint databases, in which possibly infinite data sets are described by Boolean combinations of polynomial inequality and equality constraints, has its main application area in spatial databases. The standard query... more
The formalism of constraint databases, in which possibly infinite data sets are described by Boolean combinations of polynomial inequality and equality constraints, has its main application area in spatial databases. The standard query language for polynomial constraint databases is first-order logic over the reals. Because of the limited expressive power of this logic with respect to queries that are important in spatial data base applications, various extensions have been introduced.
First of all, I would like to thank Jan Van den Bussche for his support during the last four years. His never ending enthusiasm and advice made this dissertation possible. His intensive use of red ballpoints, learned me, hopefully, how to... more
First of all, I would like to thank Jan Van den Bussche for his support during the last four years. His never ending enthusiasm and advice made this dissertation possible. His intensive use of red ballpoints, learned me, hopefully, how to write scientific texts. Further, I am grateful to Bart Kuijpers. His geometrical insights were of great help. I am also grateful to Peter Revesz. The Renumbering Algorithm in Chapter 6 is the result of our joint work. I am also grateful to Bart Goethals.
Historical Background The field of constraint databases was initiated in 1990 in a paper by Kanellakis, Kuper and Revesz [9]. The goal was to obtain a database-style, optimizable version of constraint logic programming. It grew out of the... more
Historical Background The field of constraint databases was initiated in 1990 in a paper by Kanellakis, Kuper and Revesz [9]. The goal was to obtain a database-style, optimizable version of constraint logic programming. It grew out of the research on DATALOG and constraint logic programming.
We study queries to spatial databases, where spatial data are modeled as semi-algebraic sets, using the relational calculus with polynomial inequalities as a basic query language. We work with the extension of the relational calculus with... more
We study queries to spatial databases, where spatial data are modeled as semi-algebraic sets, using the relational calculus with polynomial inequalities as a basic query language. We work with the extension of the relational calculus with terminating transitive closures. The main result is that this language can express the linearization of semialgebraic databases. We also show that the sublanguage with linear inequalities only can express all computable queries on semilinear databases.
Abstract—This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This... more
Abstract—This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things.
Recent studies emerged the need for representations of frequent itemsets that allow to estimate supports. Several methods have been proposed that achieve this goal by generating only a subset of all frequent itemsets. In this paper, we... more
Recent studies emerged the need for representations of frequent itemsets that allow to estimate supports. Several methods have been proposed that achieve this goal by generating only a subset of all frequent itemsets. In this paper, we propose another approach, that given a minimum support threshold, stores only a small portion of the original database from which the supports of frequent itemsets can be estimated.
Page 1. On the Complexity of Package Recommendation Problems Ting Deng School of Computer Science and Engineering Beihang University Beijing, China dengting@act.buaa.edu. cn Wenfei Fan Lab. for Foundations of Computer Science School of... more
Page 1. On the Complexity of Package Recommendation Problems Ting Deng School of Computer Science and Engineering Beihang University Beijing, China dengting@act.buaa.edu. cn Wenfei Fan Lab. for Foundations of Computer Science School of Informatics University of Edinburgh, UK wenfei@inf.ed.ac.uk Floris Geerts Dept.
Abstract A variety of dependency formalisms have been studied for improving data quality. To treat these dependencies in a uniform framework, we propose a simple language, Quality Improving Dependencies (QIDs). We show that previous... more
Abstract A variety of dependency formalisms have been studied for improving data quality. To treat these dependencies in a uniform framework, we propose a simple language, Quality Improving Dependencies (QIDs). We show that previous dependencies considered for data quality can be naturally expressed as QIDs, and that different enforcement mechanisms of QIDs yield various data repairing strategies.
We consider a number of decision problems, that appear in the dynamical systems and database literature, concerning the termination of iterates of real functions. These decision problems take a function f: R n→ R n as input and ask, for... more
We consider a number of decision problems, that appear in the dynamical systems and database literature, concerning the termination of iterates of real functions. These decision problems take a function f: R n→ R n as input and ask, for example, whether this function is mortal, nilpotent, terminating, or reaches a fixed point on a given point in R n. We associate topologies to functions f: R n→ R n and study some basic properties of these topologies.
Abstract We give two efficient on-line algorithms to simplify weighted graphs by eliminating degree-two vertices. Our algorithms are on-line---they react to updates on the data, keeping the simplification up-to-date. We provide both... more
Abstract We give two efficient on-line algorithms to simplify weighted graphs by eliminating degree-two vertices. Our algorithms are on-line---they react to updates on the data, keeping the simplification up-to-date. We provide both analytical and empirical evaluations of the efficiency of our algorithms. We prove an O (log n) upper bound on the amortized time complexity of our maintenance algorithms, with n the number of insertions. One of our algorithms can handle in logarithmic time the deletions of vertices and edges as well.
We study extensions of first-order logic over the reals with different types of transitive-closure operators as query languages for constraint databases that can be described by Boolean combinations of polynomial inequalities over the... more
We study extensions of first-order logic over the reals with different types of transitive-closure operators as query languages for constraint databases that can be described by Boolean combinations of polynomial inequalities over the reals. We are in particular interested in deciding the termination of the evaluation of queries expressible in these transitive-closure logics. It turns out that termination is undecidable in general.
Abstract We show that the evaluation of SPARQL algebra queries on various notions of annotated RDF graphs can be seen as particular cases of the evaluation of these queries on RDF graphs annotated with elements of so-called spm-semirings.... more
Abstract We show that the evaluation of SPARQL algebra queries on various notions of annotated RDF graphs can be seen as particular cases of the evaluation of these queries on RDF graphs annotated with elements of so-called spm-semirings. Spm-semirings extend semirings, used for positive relational algebra queries on annotated relational data, with a new operator to capture the semantics of the non-monotone SPARQL operator OPTIONAL.
Abstract A number of languages have been developed for specifying XML publishing, that is, transformations of relational data into XML trees. These languages generally describe the behaviors of a middleware controller that builds an... more
Abstract A number of languages have been developed for specifying XML publishing, that is, transformations of relational data into XML trees. These languages generally describe the behaviors of a middleware controller that builds an output tree iteratively, issuing queries to a relational source and expanding the tree with the query results at each step. To study the complexity and expressive power of XML publishing languages, this article proposes a notion of publishing transducers, which generate XML trees from relational data.
The constraint database model can be seen as a generalization of the classical relational database model that was introduced by Codd in the 1970s to deal with the management of alpha-numerical data, typically in business applications... more
The constraint database model can be seen as a generalization of the classical relational database model that was introduced by Codd in the 1970s to deal with the management of alpha-numerical data, typically in business applications (Codd, 1970). A relational database can be viewed as a finite collection of tables or relations that each contain a finite number of tuples. Fig. 13.1 shows an instance of a relational database that contains the two relations Beer and Pub.
This paper introduces a model to study the phenomenon of long range dependence. This model consists of an infinite superposition of independent Markovian ON/OFF–sources. A condition for assuring long range dependence is given and the... more
This paper introduces a model to study the phenomenon of long range dependence. This model consists of an infinite superposition of independent Markovian ON/OFF–sources. A condition for assuring long range dependence is given and the Hurst parameter together with the correlation decay is derived for a specific example. We also give a physical interpretation of the existing long range dependence by means of the Ising model.
Abstract: It is known that to a planar spatial database, represented by a semi-algebraic set in the plane, one can associate a structure, here called the" topological canonization", such that two databases are topologically equivalent if... more
Abstract: It is known that to a planar spatial database, represented by a semi-algebraic set in the plane, one can associate a structure, here called the" topological canonization", such that two databases are topologically equivalent if and only if their topological canonizations are isomorphic. The advantage of a topological canonization is that it contains precisely the information one needs if one is only interested in topological properties of the spatial data. In this paper we represent semi-algebraic sets using plane graph structures.
Abstract To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the... more
Abstract To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the generated XSD resembles the target schema. In this paper, we provide the formal foundation for such a testbed. We adopt similarity measures based on counting the number of common and different trees in the two languages, and we develop the necessary machinery for computing them.
Abstract This paper revisits the analysis of annotation propagation from source databases to views defined in terms of conjunctive (SPJ) queries. Given a source database D, an SPJ query Q, the view Q (D) and a tuple∆ V in the view, the... more
Abstract This paper revisits the analysis of annotation propagation from source databases to views defined in terms of conjunctive (SPJ) queries. Given a source database D, an SPJ query Q, the view Q (D) and a tuple∆ V in the view, the view (resp. source) side-effect problem is to find a minimal set∆ D of tuples such that the deletion of∆ D from D results in the deletion of∆ V from Q (D) while minimizing the side effects on the view (resp. the source).
Abstract Data in real-life databases become obsolete rapidly. One often finds that multiple values of the same entity reside in a database. While all of these values were once correct, most of them may have become stale and inaccurate.... more
Abstract Data in real-life databases become obsolete rapidly. One often finds that multiple values of the same entity reside in a database. While all of these values were once correct, most of them may have become stale and inaccurate. Worse still, the values often do not carry reliable timestamps. With this comes the need for studying data currency, to identify the current value of an entity in a database and to answer queries with the current values, in the absence of timestamps.
Abstract In this paper we study the correlation structure of the output process of an ATM multiplexer. We consider two special cases:(i) the output process of the D-BMAP/D/1/N queue, a generic model for an ATM multiplexer and (ii) a... more
Abstract In this paper we study the correlation structure of the output process of an ATM multiplexer. We consider two special cases:(i) the output process of the D-BMAP/D/1/N queue, a generic model for an ATM multiplexer and (ii) a process which results from a renewal process which shares the output link of a multiplexer with other connections. Both output processes belong to the versatile class of discrete-time Markovian arrival processes (D-MAP's).
Abstract This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules... more
Abstract This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations.
Abstract In the context of mining for frequent patterns using the standard level-wise algorithm, the following question arises: given the current level and the current set of frequent patterns, what is the maximal number of candidate... more
Abstract In the context of mining for frequent patterns using the standard level-wise algorithm, the following question arises: given the current level and the current set of frequent patterns, what is the maximal number of candidate patterns that can be generated on the next level? We answer this question by providing a tight upper bound, derived from a combinatorial result by J. Kruskal (1963) and G. Katona (1968). Our result is useful for reducing the number of database scans
We consider two-dimensional spatial databases defined in terms of polynomial inequalities and focus on the potential of program-ming languages for such databases to express queries related to topo-logical connectivity. It is known that... more
We consider two-dimensional spatial databases defined in terms of polynomial inequalities and focus on the potential of program-ming languages for such databases to express queries related to topo-logical connectivity. It is known that the topological connectivity test is not first-order expressible.
Moving objects are currently represented in databases by means of an explicit representation of their trajectory. However from a physical point of view, or more specifically according to Newton's second law of motion, a moving object is... more
Moving objects are currently represented in databases by means of an explicit representation of their trajectory. However from a physical point of view, or more specifically according to Newton's second law of motion, a moving object is fully described by its equation of motion. We introduce a new data model for moving objects in which a trajectory is represented by a differential equation. A similar approach is taken in computer animation where this is known as physically based modeling.
Background Haplotypes extracted from human DNA can be used for gene mapping and other analysis of genetic patterns within and across populations. A fundamental problem is, however, that current practical laboratory methods do not give... more
Background Haplotypes extracted from human DNA can be used for gene mapping and other analysis of genetic patterns within and across populations. A fundamental problem is, however, that current practical laboratory methods do not give haplotype information. Estimation of phased haplotypes of unrelated individuals given their unphased genotypes is known as the haplotype reconstruction or phasing problem.
Abstract. Real-life data is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising approach to improving data quality. It effectively detects and fixes inconsistencies in real-life data... more
Abstract. Real-life data is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising approach to improving data quality. It effectively detects and fixes inconsistencies in real-life data based on conditional dependencies, an extension of database dependencies by enforcing bindings of semantically related data values.
Abstract A schema-mapping is a high level specification of a data-exchange setting where a set of source-to-target dependencies is used to realize basic operations from source to target relations (such as copy, selection, join or union)... more
Abstract A schema-mapping is a high level specification of a data-exchange setting where a set of source-to-target dependencies is used to realize basic operations from source to target relations (such as copy, selection, join or union) while the target schema is subject to a set of target constraints (such as inclusion dependencies or key constraints).
Abstract The paper investigates fundamental decision problems and composition synthesis for Web services commonly found in practice. We propose a notion of synthesized Web services (ASTs) to specify the behaviors of the services. Upon... more
Abstract The paper investigates fundamental decision problems and composition synthesis for Web services commonly found in practice. We propose a notion of synthesized Web services (ASTs) to specify the behaviors of the services. Upon receiving a sequence of input messages, an AST issues multiple queries to a database and generates actions, in parallel; it produces external messages and database updates by synthesizing the actions parallelly generated.
The relational model has recently been extended to so-called K-relations in which tuples are assigned a unique value in a semiring K. A query language, denoted by RAK+, similar to the classical positive relational algebra, allows for the... more
The relational model has recently been extended to so-called K-relations in which tuples are assigned a unique value in a semiring K. A query language, denoted by RAK+, similar to the classical positive relational algebra, allows for the querying of K-relations. In this paper, we define more expressive query languages for K-relations that extend RAK+ with the difference and constant annotations operations on annotated tuples. The latter are natural extensions of the duplicate elimination operator of the relational algebra on bags.
Annotated relational databases can be queried either by simply making the annotations explicitly available along the ordinary data, or by adapting the standard query operators so that they have an implicit effect also on the annotations.... more
Annotated relational databases can be queried either by simply making the annotations explicitly available along the ordinary data, or by adapting the standard query operators so that they have an implicit effect also on the annotations. We compare the expressive power of these two approaches. As a formal model for the implicit approach we propose the color algebra, an adaptation of the relational algebra to deal with the annotations.
Abstract Integrity constraints, aka. data dependencies, are being widely used for improving the quality of schema. Recently constraints have enjoyed a revival for improving the quality of data. The tutorial aims to provide an overview of... more
Abstract Integrity constraints, aka. data dependencies, are being widely used for improving the quality of schema. Recently constraints have enjoyed a revival for improving the quality of data. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.

And 32 more