Determining the currency of data

shuai ma

Determining the currency of data

2011

Abstract Data in real-life databases become obsolete rapidly. One often finds that multiple values of the same entity reside in a database. While all of these values were once correct, most of them may have become stale and inaccurate. Worse still, the values often do not carry reliable timestamps. With this comes the need for studying data currency, to identify the current value of an entity in a database and to answer queries with the current values, in the absence of timestamps.

Determining the Currency of Data Wenfei Fan Floris Geerts Jef Wijsen University of Edinburgh & Harbin Institute of Technology School of Informatics University of Edinburgh Institut d’Informatique Université de Mons wenfei@inf.ed.ac.uk fgeerts@inf.ed.ac.uk jef.wijsen@umons.ac.be Abstract Data in real-life databases become obsolete rapidly. One often finds that multiple values of the same entity reside in a database. While all of these values were once correct, most of them may have become stale and inaccurate. Worse still, the values often do not carry reliable timestamps. With this comes the need for studying data currency, to identify the current value of an entity in a database and to answer queries with the current values, in the absence of timestamps. This paper investigates the currency of data. (1) We propose a model that specifies partial currency orders in terms of simple constraints. The model also allows us to express what values are copied from other data sources, bearing currency orders in those sources, in terms of copy functions defined on correlated attributes. (2) We study fundamental problems for data currency, to determine whether a specification is consistent, whether a value is more current than another, and whether a query answer is certain no matter how partial currency orders are completed. (3) Moreover, we identify several problems associated with copy functions, to decide whether a copy function imports sufficient current data to answer a query, whether such a function copies redundant data, whether a copy function can be extended to import necessary current data for a query while respecting the constraints, and whether it suffices to copy data of a bounded size. (4) We establish upper and lower bounds of these problems, all matching, for combined complexity and data complexity, and for a variety of query languages. We also identify special cases that warrant lower complexity. s1 : s2 : s3 : s4 : s5 : FN Mary Mary Mary Bob Robert LN address salary Smith 2 Small St 50k Dupont 10 Elm Ave 50k Dupont 6 Main St 80k Luth 8 Cowan St 80k Luth 8 Drum St 55k (a) Relation Emp status single married married married married t1 : t2 : t3 : t4 : dname R&D R&D R&D R&D mgrFN Mary Mary Mary Ed budget 6500k 7000k 6000k 6000k mrgLN Smith Smith Dupont Luth (b) Relation mgrAddr 2 Small St 2 Small St 6 Main St 8 Cowan St Dept Figure 1: A company database is, in a database of 500 000 customer records, 10 000 records may go stale per month, 120 000 records per year, and within two years about 50% of all the records may be obsolete. In light of this, we often find that multiple values of the same entity reside in a database, which were once correct, i.e., they were true values of the entity at some time. However, most of them have become obsolete and inaccurate. As an example from daily life, when one moves to a new address, her bank may retain her old address, and worse still, her credit card bills may still be sent to her old address for quite some time (see, e.g., [22] for more examples). Stale data is one of the central problems to data quality. It is known that dirty data costs us businesses 600 billion usd each year [15], and stale data accounts for a large part of the losses. This highlights the need for studying the currency of data, which aims to identify the current values of entities in a database, and to answer queries with the current values. The question of data currency would be trivial if all data values carried valid timestamps. In practice, however, one often finds that timestamps are unavailable or imprecise [34]. Add to this the complication that data values are often copied or imported from other sources [2, 12, 13], which may not support a uniform scheme of timestamps. These make it challenging to identify the current values. Not all is lost. One can often deduce currency orders from the semantics of the data. Moreover, data copied from other sources inherit currency orders from those sources. Taken together, these often provide sufficient current values of the data to answer certain queries, as illustrated below. Categories and Subject Descriptors: H.2.3 [Information Systems]: Database Management – Languages; F.4.1 [Mathematical Logic and Formal Languages]: Mathematical Logic — Computational Logic General Terms: Languages, Theory, Design. 1. Introduction The quality of data in a real-life database quickly degenerates over time. Indeed, it is estimated that “ 2% of records in a customer file become obsolete in one month” [15]. That Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PODS’11, June 13–15, 2011, Athens, Greece. Copyright 2011 ACM 978-1-4503-0660-7/11/06 ...$10.00. Example 1.1: Consider two relations of a company shown in Fig. 1. Each Emp tuple is an employee record with name, address, salary and marital status. A Dept tuple specifies the name, manager and budget of a department. Records in these relations may be stale, and do not carry timestamps. By entity identification techniques (see, e.g., [16]), we know 71 that tuples s1 , s2 and s3 refer to the same employee Mary, but s4 and s5 represent different people distinct from Mary. Consider the following queries posed on these relations. (1) Query Q1 is to find Mary’s current salary. No timestamps are available for us to tell which of 50k or 80k is more current. However, we may know that the salary of each employee in the company does not decrease, as commonly found in real world. This yields currency orders s1 ≺salary s3 and s2 ≺salary s3 , i.e., s3 [salary] is more current than both s1 [salary] and s2 [salary]. Hence the answer to Q1 is 80k. (2) Query Q2 is to find Mary’s current last name. We can no longer answer Q2 as above. Nonetheless, we may know the following: (a) marital status can only change from single to married and from married to divorced; but not from married to single; and (b) Emp tuples with the most current marital status also contain the most current last name. Therefore, s1 ≺LN s2 and s1 ≺LN s3 , and the answer to Q2 is Dupont. (3) Query Q3 is to find Mary’s current address. We may know that Emp tuples with the most current status or salary contain the most current address. Putting this and (1) above together, we know that the answer to Q3 is “6 Main St”. (4) Finally, query Q4 is to find the current budget of department R&D. Again no timestamps are available for us to evaluate the query. However, we may know the following: (a) Dept tuples t1 and t2 have copied their mgrAddr values from s1 [address] in Emp; similarly, t3 has copied from s3 , and t4 from s4 ; and (b) in Dept, tuples with the most current address also have the most current budget. Taken together, these tell us that t1 ≺budget t3 and t2 ≺budget t3 . Observe that we do not know which budget in t3 or t4 is more current. Nevertheless, in either case the most current budget is 6000k, and hence it is the answer to Q4 . ✷ practice. For instance, all the currency relations we have seen in Example 1.1 can be expressed as denial constraints. (3) We define a copy relationship from relation Dj to Dk in terms of a partial mapping, referred to as a copy function. It specifies what attribute values in Dj have been copied from Dk along with their currency orders in Dk . It also assures that correlated attributes are copied together. As observed in [2, 12, 13], copy functions are common in real world, and can be automatically discovered. Putting these together, we consider D = (D1 , . . . , Dn ), a collection of relations such that (a) each Dj has currency orders partially defined on its tuples for each attribute, indicating available currency information; (b) each Dj satisfies a set Σj of denial constraints, which expresses currency orders derived from the semantics of the data; and (c) for each pair Dj , Dk of relations, there are possibly copy functions defined on them, which import values from one to another. We study consistent completions Djc of Dj , which extend ≺A in Dj to a total order on all tuples pertaining to the same entity, such that Djc satisfies Σj and constraints imposed by the copy functions. One can construct from Djc the current tuple for each entity w.r.t. ≺A , which contains the entity’s most current A value for each attribute A. This yields the current instance of Djc consisting of only the current tuples of the entities in Dj , from which currency orders are removed. We evaluate a query Q on current instances of relations in D, without worrying about currency orders. We study certain current answers to Q in D, i.e., tuples that are the answers to Q in all consistent completions of D. These suggest that we give a full treatment of data currency, and answer the following questions. How should we specify currency orders on data values in the absence of timestamps but in the presence of copy relationships? When currency orders are only partly available, can we decide whether an attribute value is more up-to-date than another? How can we answer a query with only current data in a database? To answer a query, do we need to import current data from another source, and what to copy? The ability to answer these questions may provide guidance for practitioners to decide, e.g., whether the answer to a query is corrupted by stale data, or what copy functions are needed. Reasoning about data currency. We study fundamental problems for data currency. (a) The consistency problem is to determine, given denial constraints Σj imposed on each Dj and copy functions between these relations, whether there exist consistent completions of every Dj , i.e., whether the specification makes sense. (b) The certain ordering problem is to decide whether a currency order is contained in all consistent completions. (c) The deterministic current instance problem is to determine whether the current instance of each relation remains unchanged for all consistent completions. The ability to answer these questions allows us to determine whether an attribute value is certainly more current than another, and to identify the current value of an entity. (d) The certain current query answering problem is to decide whether a tuple t is a certain current answer to a query Q, i.e., it is certainly computed using current data. A model for data currency. To answer these questions, we approach data currency based on the following. (1) For each attribute A of a relation D, we assume an (implicit) currency order ≺A on its tuples such that for tuples t1 and t2 in D that represent the same real-world entity, t1 ≺A t2 indicates that t2 is more up-to-date than t1 in the A attribute value. Here ≺A is not a total order since in practice, currency information is only partially available. Note that for distinct attributes A and B, we may have t1 ≺A t2 and t2 ≺B t1 , i.e., there may be no single tuple that is most up-to-date in all attribute values. (2) We express additional currency relationships as denial constraints [3, 7], which are simple universally quantified FO sentences that have been used to improve the consistency of data. We show that the same class of constraints also suffices to express currency semantics commonly found in Currency preserving copy functions. It is natural to ask what values should be copied from one data source to another in order to answer a query. To characterize this intuition we introduce a notion of currency preservation. Consider data sources D = (D1 , . . . , Dp ) and D′ = (D1′ , . . . , Dq′ ), each consisting of a collection of relations with denial constraints imposed on them. Consider copy functions ρ from relations in D′ to those in D. For a query Q posed on D, we say that ρ is currency preserving if no matter how we extend ρ by copying from D′ more values of those entities in D, the certain current answers to Q in D remain unchanged. In other words, ρ has already imported data values needed for computing certain current answers to Q. We identify several problems associated with currencypreserving copy functions. (a) The currency preservation problem is to determine, given Q, ρ, D, D′ and their denial 72 Closer to this work are [31, 24, 25, 20] on querying indefinite data. In [31], the evaluation of CQ queries is studied on data that is linearly ordered but only provides a partial order. The problem studied there is similar to (yet different from) certain current query answering. An extension of conditional tables [19, 21] is proposed in [24] to incorporate indefinite temporal information, and in that setting, the complexity bounds for FO query evaluation are provided in [25]. Recently the non-emptiness problem for datalog on linear orders is investigated in [20]. However, none of these considers copying data from external sources, or the analyses of certain ordering and currency-preserving copy functions. In addition, we answer queries using current instances of relations, which are normal relations without (currency) ordering. This semantics is quite different from its counterparts in previous work. We also consider denial constraints and copy functions, which are not expressible in CQ or datalog studied in [31, 20]. In contrast to our work, [24, 25] assume explicit timestamps, while we use denial constraints to specify data currency. To encode denial constraints in extended conditional tables of [24, 25], an exponential blowup is inevitable. Because of these reasons, the results of [31, 24, 25, 20] cannot carry over to our setting, and vice versa. There has also been a large body of work on the temporal constraint satisfaction problem (TCSP), which is to find a valuation of temporal variables that satisfies a set of temporal constraints (see, e.g., [4, 29]). It differs from our consistency problem in that it considers neither completions of currency orders that satisfy denial constraints, nor copy relationships. Hence the results for TCSP are not directly applicable to our consistency problem, and vice versa. Copy relationships between data sources have recently been studied in [2, 12, 13]. The previous work has focused on automatic discovery of copying dependencies and functions. Copy relationships are also related to data provenance, which studies propagation of annotations in data transformations and updates (see [5, 6] for recent surveys on data provenance). However, to the best of our knowledge, no previous work has studied currency-preserving copy functions and their associated problems. Denial constraints have proved useful in detecting data inconsistencies and data repairing (see, e.g., [3, 7]). We adopt the same class of constraints to specify the currency of data, so that data currency and consistency could be treated in a uniform logical framework. Denial constraints can also be automatically discovered, along the same lines as data dependency profiling (see, e.g., [17]). The study of data currency is also related to research on incomplete information (see [32] for a survey), when missing data concerns data currency. In contrast to that line of work, we investigate how to decide whether a value is more current than another, and study the properties of copy functions. We use denial constraints to specify data currency, which are, as remarked earlier, more succinct than, e.g., C-tables and V-tables for representing incomplete information [19, 21]. In addition, we evaluate queries using current instances, a departure from the study of incomplete information. Certain query answers have been studied in data integration and exchange. In data integration, for a query Q posed on a global database DG , it is to find the certain answers to Q over all data sources that are consistent with DG w.r.t. view definitions (see e.g., [27]). In data exchange, it is to find the certain answers to a query over all target constraints, whether ρ is currency preserving for Q. Intuitively, we want to know whether we need to extend ρ in order to answer Q. (b) The minimal copying problem is to decide whether ρ is minimal among all currency-preserving copy functions for Q, i.e., ρ copies the least amount of data. This helps us inspect whether ρ copies unnecessary data. (c) The existence problem is to determine whether ρ can be extended to be currency preserving for Q. (d) Moreover, the bounded copying problem is to decide whether there exists such an extension that imports additional data of a bounded size. Intuitively, we want to find currency-preserving copy functions that import as few data values as possible. Complexity results. We provide combined complexity and data complexity of all the problems stated above. For the combined complexity of the problems that involve queries, we investigate the impact of various query languages, including conjunctive queries (CQ), unions of conjunctive queries (UCQ), positive existential FO (∃FO+ ) and FO. We establish upper and lower bounds of these problems, all matching, ranging over O(1), NP, coNP, ∆p2 , Πp2 , Σp2 , ∆p3 , Πp3 , Σp3 , Σp4 and PSPACE. We find that most of the problems are intractable. In light of this, we also identify special practical cases with lower complexity, some in PTIME. We also study the impact of denial constraints. For example, in the absence of denial constraints, the certain current query answering problem is in PTIME for SP queries (CQ queries without “join”), but it becomes intractable when denial constraints are present, even when the constraints are fixed. This work is a first step towards a systematic study of data currency in the absence of reliable timestamps but in the presence of copy relationships. The results help practitioners specify data currency, analyze query answers and design copy functions. We also provide a complete picture of complexity bounds for important problems associated with data currency and copy functions, which are proved by using a variety of reductions and by providing (PTIME) algorithms. Related work. There has been a host of work on temporal databases (see, e.g., [8, 30] for surveys). Temporal databases provide support for valid time, transaction time, or both. They assume the availability of timestamps, and refer to “now” by means of current-time variables [9, 14]. Dynamic and temporal integrity constraints allow to restrict the set of legal database evolutions. Our currency model differs from temporal data models in several respects. We do not assume explicit timestamps. Nevertheless, if such timestamps are present, they can be related to currency by means of denial constraints. Unlike temporal databases that timestamp entire tuples, our model allows that different values within the same tuple have distinct currencies. That is, the same tuple can contain an up-to-date value for one attribute, and an outdated value for another attribute. Since currency orders are different from temporal orders used in temporal databases, our currency (denial) constraints differ from traditional temporal constraints. Currency constraints can sometimes be derived from temporal constraints. For example, when salaries are constrained to be non-decreasing, we can express that the highest salary is the most current one. Also, our copy functions can require certain attributes to be copied together when these attributes cannot change independently, as for example expressed by the dynamic functional dependencies in [33]. 73 th [Al ] are identical (resp. distinct) values; (3) tj [Al ] = c (resp. tj [Al ] 6= c), where c is a constant; and (4) possibly other built-in predicates defined on particular domains. The constraint is interpreted over completions Dtc of temporal instances of R. We say that Dtc satisfies ϕ, denoted by Dtc |= ϕ, if for all tuples t1 , . . . , tk in D that have the same EID value, if these tuples satisfy the predicates in ψ following the standard semantics of FO, then tu ≺Ai tv . The use of EID in ϕ enforces that ϕ is imposed on tuples that refer to the same entity. We say that Dtc satisfies a set Σ of denial constraints, denoted by Dtc |= Σ, if Dtc |= ϕ for all ϕ ∈ Σ. databases generated from data sources via schema mapping (see [23]). In contrast, we consider certain answers to a query over all completions of currency orders, which satisfy denial constraints and constraints from copy functions. Certain current query answering is also different from consistent query answering (see, e.g., [3, 7]), which is to find certain answers to a query over all repairs of a database and does not distinguish between stale and current data in the repairs. Finally, whereas it may be possible to model our setting as a data exchange scenario with built-in constraints [11], our complexity results do not follow gratuitously and a careful analysis of the chase is required in this setting. Example 2.1: Recall relations Emp and Dept given in Fig. 1. Denial constraints on these relations include: Organization. Section 2 presents the data currency model. Section 3 states its related problems. Section 4 establishes the complexity bounds of those problems. Section 5 introduces the notion of currency preservation and its fundamental problems, followed by their complexity analysis in Section 6. Section 7 summarizes the main results of the paper. ϕ1 : ∀s, t : Emp (s[EID] = t[EID]∧s[salary] > t[salary]) → t ≺salary s ϕ2 : ∀s, t : Emp (s[EID] = t[EID] ∧ s[status] = “married” ∧ t[status] = “single”) → t ≺LN s ϕ3 : ∀s, t : Emp (s[EID] = t[EID]∧t ≺salary s) → t ≺address s ϕ4 : ∀s, t : Dept (s[EID] = t[EID] ∧ t ≺mgrAddr s) → t ≺budget s Here ϕ1 states that when Emp tuples s and t refer to the same employee, if s[salary] > t[salary], then s is more current than t in attribute salary. Note that ‘<’ denotes the builtin predicate “less-than” in the numeric domain of salary, whereas ≺salary is the currency order for salary. Constraint ϕ2 asserts that if s[status] is married and t[status] is single, then s is more current than t in LN. Constraint ϕ3 states that if s is more current than t in salary, then s is also more current than t in address; similarly for ϕ4 . ✷ 2. Data Currency We introduce a model for specifying data currency. A specification consists of (a) partial currency orders, (b) denial constraints, and (c) copy functions. We first present these notions, and then study consistent completions of currency orders. Finally, we show how queries are answered on current instances that are derived from these completions. Data with partial currency orders. A relation schema is specified as R = (EID, A1 , . . . , An ), where EID denotes entity id that identifies tuples pertaining to the same entity, as introduced by Codd [10]. EID values can be obtained using entity identification techniques (a.k.a. record linkage, matching and data deduplication; see, e.g., [16]). A finite instance D of R is referred to as a normal instance of R. A temporal instance Dt of R is given as (D, ≺A1 , . . . , ≺An ), where each ≺Ai is a strict partial order on D such that for tuples t1 and t2 in D, t1 ≺Ai t2 implies t1 [EID] = t2 [EID]. We call ≺Ai the currency order for attribute Ai . Recall that a strict partial order is irreflexive and transitive, and therefore asymmetric. Intuitively, if t1 ≺Ai t2 , then t1 and t2 refer to the same entity, and t2 contains a more current Ai -value for that entity than t1 , i.e., t2 is more current than t1 in attribute Ai . A currency order ≺Ai is empty when no currency information is known for attribute Ai . A completion of Dt is a temporal instance Dtc = (D, ≺cA1 , . . . , ≺cAn ) of R, such that for each i ∈ [1, n], (1) ≺Ai ⊆≺cAi , and (2) for all t1 , t2 ∈ D, t1 and t2 are comparable under ≺cAi iff t1 [EID] = t2 [EID]. The latter condition implies that ≺cAi induces a total order on tuples that refer to the same entity, while tuples representing distinct entities are not comparable under ≺cAi . We call ≺cAi a completed currency order. Copy functions. Consider two temporal instances D(t,1) = (D1 , ≺A1 , . . . , ≺Ap ) and D(t,2) = (D2 , ≺B1 , . . . , ≺Bq ) of (possibly distinct) relation schemas R1 and R2 , respectively. ~ ⇐ R2 [B] ~ is a partial A copy function ρ of signature R1 [A] ~ ~ = mapping from D1 to D2 , where A = (A1 , . . . , Al ) and B (B1 , . . . , Bl ), denoting attributes in R1 and R2 , respectively. Here ρ is required to satisfy the copying condition: for each tuple t in D1 , if ρ(t) = s, then t[Ai ] = s[Bi ] for all i ∈ [1, l]. Intuitively, for tuples t ∈ D1 and s ∈ D2 , ρ(t) = s indicates ~ attributes of t have been imported that the values of the A ~ attributes of tuple s in D2 . Here A ~ specifies a from the B list of correlated attributes that should be copied together. The copy function ρ is called ≺-compatible (w.r.t. the currency orders found in D(t,1) and D(t,2) ) if for all t1 , t2 ∈ D1 , for each i ∈ [1, l], if ρ(t1 ) = s1 , ρ(t2 ) = s2 , t1 [EID] = t2 [EID] and s1 [EID] = s2 [EID], then s1 ≺Bi s2 implies t1 ≺Ai t2 . Intuitively, ≺-compatibility requires that copy functions preserve currency orders. In other words, when attribute values are imported from D2 to D1 the currency orders on corresponding tuples defined in D(t,2) are inherited by D(t,1) . Example 2.2: Consider relations Emp and Dept shown in Fig. 1. A copy function ρ of signature Dept[mgrAddr] ⇐ Emp[address], depicted in Fig. 1 by arrows, is given as follows: ρ(t1 ) = s1 , ρ(t2 ) = s1 , ρ(t3 ) = s3 and ρ(t4 ) = s4 . That is, the mgrAddr values of t1 and t2 have both been imported from s1 [address], while t3 [mgrAddr] and t4 [mgrAddr] are copied from s3 [address] and s4 [address], respectively. The function satisfies the copying condition, since t1 [mgrAddr] = t2 [mgrAddr] = s1 [address], t3 [mgrAddr] = s3 [address], and t4 [mgrAddr] = s4 [address]. Suppose that ≺A is empty for each attribute A in Emp or Dept. Then copy function ρ is ≺-compatible w.r.t. these temporal instances of Emp and Dept. In contrast, as- Denial constraints. We use denial constraints [3, 7] to specify additional currency information derived from the semantics of data, which enriches ≺Ai . A denial constraint ϕ for R is a universally quantified FO sentence of the form: ^ (t1 [EID] = tj [EID]∧ψ) → tu ≺Ai tv , ∀t1 , . . . , tk : R j∈[1,k] where u, v ∈ [1, k], each tj is a tuple variable denoting a tuple of R, and ψ is a conjunction of predicates of the form (1) tj ≺Al th , i.e., th is more current than tj in attribute Al ; (2) tj [Al ] = th [Al ] (resp. tj [Al ] 6= th [Al ]), i.e., tj [Al ] and 74 sume that partial currency orders s1 ≺address s3 on Emp and t3 ≺mgrAddr t1 are given. Then ρ is not ≺-compatible. Indeed, since s1 , s3 pertain to the same person Mary, and t1 , t3 to the same department R&D, the relation s1 ≺address s3 should carry over into t1 ≺mgrAddr t3 , as ρ(t1 ) = s1 and ρ(t3 ) = s3 . Clearly, t3 ≺mgrAddr t1 and t1 ≺mgrAddr t3 are contradictory. ✷ Consistent completions of temporal orders. A specification S of data currency consists of (1) a collection of temporal instances D(t,i) of schema Ri for i ∈ [1, s], (2) a set Σi of denial constraints imposed on each D(t,i) , and (3) a (possibly empty) copy function ρ(i,j) that imports data from D(t,i) to D(t,j) for i, j ∈ [1, s]. It specifies data values and entities (by normal instances embedded in D(t,i) ), partial currency orders known for each relation (by D(t,i) ), additional currency information derived from the semantics of the data (Σi ), and data that has been copied from one source to another (ρ(i,j) ). These D(t,i) ’s may denote different data sources, i.e., they may not necessarily be in the same database. A consistent completion Dc of S consists of temporal inc stances D(t,i) of Ri such that for all i, j ∈ [1, s], D Dt Dtc S Dc LST(Dc ) ρ̄ ρ̄e Se De a normal instance of a relation schema R a temporal instance of R with partial currency orders a completion of partial currency orders in Dt a specification of data currency a consistent completion of a specification S the current instance of Dc a collection of copy functions in S an extension of copy functions ρ̄ an extension of specification S by ρ̄e an extension of temporal instances by ρ̄e Table 1: A summary of notations As another example, suppose that there is a copy function ρ2 that imports budget attribute values of t1 and t3 from the budget attributes of s′′1 and s′′3 in another source D2 , respectively, where s′′1 = t1 and s′′3 = t3 , but in D2 , s′′3 ≺budget s′′1 . Then there is no consistent completion in this setting either. Indeed, all completed currency orders of ≺budget in Dept have to satisfy denial constraints ϕ1 , ϕ3 and ϕ4 , which enforce t1 ≺budget t3 , but ρ2 is not ≺-compatible with this currency order. This shows the interaction between denial constraints and currency constraints of copy functions. ✷ Current instances. In a temporal instance Dt = (D, ≺A1 , . . . , ≺An ) of R, let E = {t[EID] | t ∈ D}, and for each entity e ∈ E, let Ie = {t ∈ D | t[EID] = e}. That is, E contains all EID values in D, and Ie is the set of tuples pertaining to the entity whose EID is e. In a completion Dtc of Dt , for each attribute A of R, the current A value for entity e ∈ E is the value t[A], where t is the greatest (i.e., most current) tuple in the totally ordered set (Ie , ≺cA ). The current tuple for entity e ∈ E, denoted by LST(e, Dtc ), is the tuple te such that for each attribute A of R, te [A] is the current A value for entity e. We use LST(Dtc ) to denote LST(e, Dtc ) | e ∈ E , referred to as the current instance of Dtc . Observe that LST(Dtc ) is a normal instance of R, carrying no currency orders. For any c c Dc ∈ Mod(S), we define LST(Dc ) = LST(D(t,i) ) | D(t,i) ∈ c D , the set of all current instances. c 1. D(t,i) is a completion of D(t,i) , c 2. D(t,i) |= Σi , and 3. ρ(i,j) is compatible w.r.t. the completed currency orc c ders found in D(t,i) and D(t,j) . We use Mod(S) to denote the set of all consistent completions of S. We say that S is consistent if Mod(S) 6= ∅, i.e., there exists at least one consistent completion of S. Intuitively, if D(t,i) = (Di , ≺A1 , . . . , ≺An ) is part of a specc ification and D(t,i) = (Di , ≺cA1 , . . . , ≺cAn ) is part of a consistent completion of that specification, then each ≺cAj extends ≺Aj to a completed currency order, and the completed orders satisfy the denial constraints Σi and the constraints imposed by copy functions. Observe that the copying condition and ≺-compatibility impose constraints on consistent completions. This is particularly evident when a data source imports data from multiple sources, and when two data sources copy from each other, directly or indirectly. In addition, these constraints interact with denial constraints. Example 2.4: Recall the completion Dc0 of S0 from Exam ple 2.3. Then LST(Dc0 ) = LST(Emp), LST(Dept) , where LST(Emp) = {s3 , s4 , s5 }, and LST(Dept) = {t3 }. Note that LST(Emp) and LST(Dept) are normal instances. As another example, suppose that s4 and s5 refer to the same person. Consider an extension of the currency orders given in Dc0 by adding s4 ≺A s5 and s5 ≺B s4 , where A ranges over FN, LN, address and status while B is salary. Then the current tuple of this person is (Robert, Luth, 8 Drum St, 80k, married), in which the first four attributes are taken from s5 while its salary attribute is taken from s4 . ✷ Example 2.3: Consider a specification S0 consisting of Emp and Dept of Fig. 1, the denial constraints ϕ1 –ϕ4 given in Example 2.1, and the copy ρ defined in Example 2.2. Assume that no currency orders are known for Emp and Dept initially. A consistent completion Dc0 of S0 defines (1) s1 ≺A s2 ≺A s3 when A ranges over FN, LN, address, salary and status for Emp tuples, and (2) t1 ≺B t2 ≺B t4 ≺B t3 when B ranges over mgrFN, mgrLN, mgrAddr and budget for Dept tuples (here we assume that dname is the EID attribute of Dept). One can verify that Dc0 satisfies the denial constraints and the constraints imposed by ρ, and hence, Dc0 ∈ Mod(S0 ). Note that no currency order is defined between any of s1 , s2 , s3 and any of s4 , s5 , since they represent different entities. Evaluating queries with current values. Consider a query Q posed on normal instances of (R1 , . . . , Rl ), which does not refer to currency orders, where Ri is in specification S for i ∈ [1, l]. We say that a tuple t is a certain current answer to Q w.r.t. S if t is in \ Q LST(Dc ) . Suppose that Dept also copies from a source D1 consisting of a single tuple s′1 , which is the same as s1 except that s′1 [address] = “5 Elm Ave”. It uses a copy function ρ1 that imports s′1 [address] to t1 [mrgAddr]. Then there exists no consistent completion in this setting since t1 may not import distinct values s′1 [address] and s1 [address] for t1 [mrgAddr]. In other words, the constraints imposed by the copying conditions of ρ and ρ1 cannot be satisfied at the same time. Dc ∈Mod(S) That is, t is warranted to be an answer computed from the current values no matter how the partial currency orders in S are completed, as long as the denial constraints and constraints imposed by the copy functions of S are satisfied. Example 2.5: Recall queries Q1 , Q2 , Q3 and Q4 from Example 1.1, and specification S0 from Example 2.3. One can 75 c Dc ∈ Mod(S0 ), if DEmp is the completion of the Emp instance c c in D , then LST(DEmp ) = {s3 , s4 , s5 }. ✷ verify that answers to the queries given in Example 1.1 are certain current answers w.r.t. S0 , i.e., the answers remain unchanged in LST(Dc ) for all Dc ∈ Mod(S0 ). ✷ Query answering. Given a query Q, we want to know whether a tuple t is in Q LST(Dc ) for all Dc ∈ Mod(S). We summarize notations in Table 2, including those given in this section and notations to be introduced in Section 5. CCQA(LQ ): The certain current query answering problem. INPUT: A specification S, a tuple t and a query 3. Decision Problems for Data Currency Q ∈ LQ . We study four problems associated with data currency. QUESTION: Is t a certain current answer to Q w.r.t. S? The consistency of specifications. The first problem is to decide whether a given specification S makes sense, i.e., whether there exists any consistent completion of S. As shown in Example 2.3, there exist specifications S such that Mod(S) is empty, because of the interaction between denial constraints and copy functions, among other things. We study CCQA(LQ ) when LQ ranges over the following query languages (see, e.g., [1] for the details): • CQ, the class of conjunctive queries built up from relation atoms and equality (=), by closing under conjunction ∧ and existential quantification ∃; • UCQ, unions of conjunctive queries of the form Q1 ∪ · · ·∪Qk , where for each i ∈ [1, k], Qi is in CQ; CPS: The consistency problem for specifications. INPUT: A specification S of data currency. QUESTION: Is Mod(S) nonempty? • ∃FO+ , first-order logic (FO) queries built from atomic formulas, by closing under ∧, disjunction ∨ and ∃; and Certain currency orders. The next question studies whether a given currency order is contained in all consistent completions of a specification. Given two temporal instances D(t,1) = (D, ≺A1 , . . . , ≺An ) and D(t,2) = (D, ≺′A1 , . . . , ≺′An ) of the same schema R, we say that D(t,1) is contained in D(t,2) , denoted by D(t,1) ⊆ D(t,2) , if ≺Aj ⊆≺′Aj for all j ∈ [1, n]. Consider a specification S in which there is a temporal instance Dt = (D, ≺A1 , . . . , ≺An ) of schema R. A currency order for Dt is a temporal instance Ot = (D, ≺′A1 , . . . , ≺′An ) of R. Observe that Ot does not necessarily contain Dt . • FO queries built from atomic formulas using ∧, ∨, negation ¬, ∃ and universal quantification ∀. While different query languages have no impact on the data complexity of CCQA(LQ ), as will be seen soon, they do make a difference when the combined complexity is concerned. 4. Reasoning about the Currency of Data In this section we focus on CPS, COP, DCIP and CCQA. We establish the data complexity and combined complexity of these problems. For the data complexity, we fix denial constraints and queries (for CCQA), and study the complexity in terms of varying size of data sources and copy functions. For the combined complexity we also allow denial constraints and queries to vary (see, e.g., [1] for a detailed discussion of data and combined complexity). COP: INPUT: The certain ordering problem. A specification S in which Dt is a temporal instance, and a currency order Ot for Dt . QUESTION: Is for all Dc ∈ Mod(S), Ot ⊆ Dtc ? Here Dtc is the completion of Dt in Dc . Example 3.1: Consider specification S0 of Example 2.3. We want to know whether s1 ≺salary s3 is assured by every completion Dc ∈ Mod(S0 ). To this end we construct a currency order Ot = (Emp, ≺FN , ≺LN , ≺address , ≺salary , ≺status ), in which s1 ≺salary s3 is in ≺salary , but the partial orders for all other attributes are empty. One can verify that Ot is indeed a certain currency order, as assured by denial constraint ϕ1 . Similarly, one can define a currency order Ot′ to check whether t3 ≺mgrFN t4 is entailed by all Dc ∈ Mod(S0 ). One can readily verify that it is not the case. Indeed, there exists Dc1 ∈ Mod(S0 ), such that t4 ≺mgrFN t3 is given in Dc1 . ✷ The consistency of specifications. We start with CPS, which is to decide, given a specification S consisting of partial currency orders, denial constraints and copy functions, whether there exists any consistent completion in Mod(S). The result below tells us the following. (1) The problem is nontrivial: it is Σp2 -complete. It remains intractable when denial constraints are fixed (data complexity). (2) Denial constraints are a major factor that makes the problem hard. Indeed, the complexity bounds are not affected even when no copy functions are defined in S. Certain current instances. Given a specification S of data currency, one naturally wants to know whether every consistent completion of S yields the same current instances. We say that a specification S of data currency is deterministic for current instances if for all consistent completions Dc1 , Dc2 ∈ Mod(S), LST(Dc1 ) = LST(Dc2 ). This definition naturally carries over to a particular relation schema R: specification S is said to be deterministic for current R instances if for all consistent completions Dc1 , Dc2 ∈ Mod(S), the instance of R in LST(Dc1 ) is equal to the instance of R in LST(Dc2 ). Theorem 4.1: For CPS, (1) the combined complexity is Σp2 -complete, and (2) the data complexity is NP-complete. The upper bounds and lower bounds remain unchanged even in the absence of copy functions. ✷ Proof sketch: (1) Lower bounds. For the combined complexity, we show that CPS is Σp2 -hard by reduction from the ∃∗ ∀∗ 3sat problem, which is Σp2 -complete (cf. [28]). Given a sentence φ = ∃X∀Y ψ(X, Y ), we construct a specification S consisting of a single temporal instance Dt of a binary relation schema and a set Γ of denial constraints, such that φ is true iff Mod(S) 6= ∅. We use Dt to encode truth assignments µX for X, and Γ to assure that µX satisfies ∀Y ψ(X, Y ) if there exists a consistent completion of Dt . VHere ∀Y ψ(X, Y ) is encoded by leveraging the property ∀Y ( i∈[1,r] Ci (X, Y )) V V = i∈[1,r] ∀Y Ci (X, Y ), for ψ(X, Y ) = i∈[1,r] Ci (X, Y ). DCIP: The deterministic current instance problem INPUT: A specification S. QUESTION: Is S deterministic for current instances? Example 3.2: The specification S0 of Example 2.3 is deterministic for current Emp instances. Indeed, for all 76 For the data complexity, we show that CPS is NP-hard by reduction from the Betweenness problem, which is NPcomplete (cf. [18]). Given two sets E and F = { (ei , ej , ek ) | ei , ej , ek ∈ E }, the Betweenness problem is to decide whether there is a bijection π : E → { 1, . . . , |E| } such that for each (ei , ej , ek ) ∈ E, either π(ei ) < π(ej ) < π(ek ) or π(ek ) < π(ej ) < π(ei ). Given E and F , we define a specification S with a temporal instance Dt of a 4-ary schema, and a set of fixed denial constraints. We show that there exists a solution to the Betweenness problem iff Mod(S) is nonempty. (2) Upper bounds. We provide an algorithm that, given a specification S, guesses a completion Dc of total orders in S, and then checks whether Dc ∈ Mod(S). The checking involves (a) denial constraints and (b) the copying condition and ≺-compatibility of copy functions in S. Step (b) is in PTIME. Step (a) is in PTIME if the denial constraints are fixed, and it uses an NP-oracle otherwise. Hence CPS is in p NP for data complexity and in Σ2 for combined complexity. In the proofs for the lower bounds, no copy functions are defined, and the relation schemas are fixed. ✷ (a) For CQ, we verify it by reduction from the ∀∗ ∃∗ 3sat problem, which is Πp2 -complete (cf. [28]). Given a sentence φ = ∀X∃Y ψ(X, Y ), we define a CQ query Q, a fixed tuple t, and a specification S consisting of five temporal instances of fixed schemas. We use these temporal instances to encode (i) disjunction and negation, which are not expressible in CQ, (ii) truth assignments µX for X, with an instance DX , and (iii) relations for inspecting whether t is an answer to Q. Query Q encodes ∃Y ψ(X, Y ) w.r.t. µX , such that φ is true iff t is an answer to Q for each consistent completion of DX , i.e., when µX ranges over all truth assignments for X. (b) For FO, we show that CCQA is PSPACE-hard by reduction from Q3SAT, which is PSPACE-complete (cf. [28]). Given an instance φ of Q3SAT, we define an FO query Q, a fixed tuple t and a specification S with a single temporal instance. Query Q encodes φ, and the relation encodes Boolean values for which there is a single completion Dc0 in Mod(S). We show that φ is true iff t is in Q(Dc0 ). For the data complexity, we show that CCQA is coNP-hard even for CQ, by reduction from the complement of 3sat. Given a propositional formula ψ, we define a fixed CQ query Q, a fixed tuple t and a specification S consisting of two temporal instances Dψ and D¬ψ of fixed relation schemas. We use Dψ to encode (i) truth assignments µX for variables X in ψ, and (ii) literals in ψ. We encode the negations of clauses in ψ using D¬ψ , for which there is a unique consistent completion. For each consistent completion of Dψ , i.e., each µX for X, query Q returns t iff ψ is not satisfied by µX . In the lower bound proofs, neither denial constraints nor copy functions are defined, and all the schemas are fixed. Upper bounds. We develop an algorithm that, given a query Q, a tuple t and a specification S, returns “no” if there exists Dc ∈ Mod(S) such that t 6∈ Q LST(Dc ) . The algorithm first c guesses Dc , and then checks whether (a) D ∈ Mod(S) and c (b) t 6∈ Q LST(D ) . Step (b) is in coNP when Q is in ∃FO+ , and is in PSPACE if Q is in FO. When Q is fixed, step (b) is in PTIME no matter whether Q is in CQ or FO. Putting these together with Theorem 4.1 (for step (a)), we conclude that the data complexity of CCQA is in coNP, and its combined ✷ complexity is in Πp2 for ∃FO+ and in PSPACE for FO. The certainty of currency orders. We next study COP and DCIP. The certain currency ordering problem COP is to determine, given a specification S and a currency order Ot , whether each t ≺A s in Ot is entailed by the partial currency orders, denial constraints and copy functions in S. The deterministic current instance problem DCIP is to decide, given S, whether the current instance of each temporal instance of S is unchanged for all consistent completions of S. These problems are, unfortunately, also beyond reach in practice. Corollary 4.2: For both COP and DCIP, (1) the combined complexity is Πp2 -complete, and (2) the data complexity is coNP-complete. The complexity bounds remain unchanged when no copy functions are present. ✷ Proof sketch: (1) Lower bounds. For both COP and DCIP, the lower bounds are verified by reduction from the complement of CPS, for data complexity and combined complexity. (2) Upper bounds. A non-deterministic algorithm is developed for each of COP and DCIP, which is in coNP (data complexity) and Πp2 (combined complexity). ✷ Special cases. The results above tell us that it is nontrivial to reason about data currency. In light of this, we look into special cases of these problems with lower complexity. As shown by Theorem 4.1 and Corollary 4.2, denial constraints make the analyses of CPS, COP and DCIP intricate. Indeed, these problems are intractable even when denial constraints are fixed. Hence we consider specifications with no denial constraints, but containing partial currency orders and copy functions. The result below shows that the absence of denial constraints indeed simplifies the analyses. Query answering. The certain currency query answering problem CCQA(LQ ) is to determine, given a tuple t, a spec ification S and a query Q ∈ LQ , whether t ∈ Q LST(Dc ) for c all D ∈ Mod(S). The result below provides the data complexity of the problem, as well as its combined complexity when LQ ranges over CQ, UCQ, ∃FO+ and FO. It tells us the following. (1) Disjunctions in UCQ and ∃FO+ do not incur extra complexity to CCQA. Indeed, CCQA has the same complexity for CQ as for UCQ and ∃FO+ . (2) In contrast, the presence of negation in FO complicates the analysis. (3) Copy functions have no impact on the complexity bounds. Theorem 4.4: In the absence of denial constraints, CPS, COP and DCIP are in PTIME. ✷ Theorem 4.3: The combined complexity of CCQA(LQ ) is • Πp2 -complete when LQ is CQ, UCQ or ∃FO+ , and • PSPACE-complete when LQ is FO. The data complexity is coNP-complete when LQ ∈ {CQ, UCQ,∃FO+ , FO}. These complexity bounds are unchanged in the absence of copy functions. ✷ Proof sketch: For CPS, we develop an algorithm that, given a specification S with no denial constraints defined, checks whether Mod(S) 6= ∅. Let ρ̄ denote the collection of copy functions in S. One can verify that in the absence of denial constraints, S is consistent iff there exists no violation of the copying condition or ≺-compatibility of ρ̄ in any temporal instance of S. Hence it suffices to detect violations in the instances of S, rather than in their completions. Proof sketch: Lower bounds. For the combined complexity, we show the following: (a) CCQA is already Πp2 -hard for CQ, and (b) it is PSPACE-hard for FO. 77 is coNP-hard (data complexity) and Πp2 -hard (combined complexity) by reduction from the complement of CPS, for which the complexity is established in Theorem 4.1. Given a specification S, we construct a specification S′ , a tuple t and an identity query Q, such that Mod(S) is empty iff t is in Q LST(Dc ) for each Dc ∈ Mod(S′ ). The upper bounds of CCQA is this setting follow from Theorem 4.3. ✷ As shown in Example 2.2, it is not straightforward to check ≺-compatibility, especially when tuples are imported indirectly from other sources. Nonetheless, we show that this can be done in O(|S|2 ) time, where |S| is the size of S. For COP, we show that given a specification S without denial constraints, to decide whether a currency order is contained in all completions of S, it suffices to check temporal instances of S. This can be decided by a variation of the algorithm for CPS, also in PTIME; similarly for DCIP. ✷ 5. Currency Preservation in Data Copying As we have seen earlier, copy functions tell us what data values in a relation have been imported from other data sources. Naturally we want to leverage the imported values to improve query answers. This gives rise to the following questions: do the copy functions import sufficient current data values for answering a query Q? If not, how do we extend the copy functions such that Q can be answered with more up-to-date data values? To answer these questions we introduce a notion of currency-preserving copy functions. We consider a specification S of data currency consisting of two collections of temporal instances (data sources) D = (D1 , . . . , Dp ) and D′ = (D1′ , . . . , Dq′ ), with (1) a set Σi (resp. Σ′j ) of denial constraints on Di for each i ∈ [1, p] (resp. Dj′ for j ∈ [1, q]), and (2) a collection ρ of copy functions ρ(j,i) that imports tuples from Dj′ to Di , for i ∈ [1, p] and j ∈ [1, q], i.e., all the functions of ρ import data from D′ to D. In contrast, the absence of denial constraints does not make our lives easier when it comes to CCQA. Indeed, in the proof of Theorem 4.3, the lower bounds of CCQA are verified using neither denial constraints nor copy functions. Corollary 4.5: In the absence of denial constraints, CCQA(LQ ) remains coNP-hard (data complexity) and • Πp2 -hard (combined complexity) even for CQ, and • PSPACE-hard (combined complexity) for FO. ✷ Theorem 4.3 tells us that the complexity of CCQA for CQ is rather robust: adding disjunctions does not increase the complexity. We next investigate the impact of removing Cartesian product from CQ on the complexity of CCQA. We consider SP queries, which are CQ queries of the form Q(~x) = ∃e ~ y R(e, ~x, ~ y) ∧ ψ , where ψ is a conjunction of equality atoms and ~x and ~ y are disjoint sequences of variables in which no variable appears twice. SP queries support projection and selection only. For instance, Q1 – Q4 of Example 1.1 are SP queries. SP queries in which ψ is a tautology are referred to as identity queries. We show that for SP queries, denial constraints make a difference. Without denial constraints, CCQA is in PTIME for SP queries. In contrast, when denial constraints are imposed, CCQA is no easier for identity queries than for ∃FO+ . Extensions. To formalize currency preservation, we first present the following notions. Assume that Di = (D, ≺A1 , . . . , ≺An ) and Dj′ = (D′ , ≺B1 , . . . , ≺Bm ) are temporal instances of relation schemas Ri = (EID, A1 , . . . , An ) and Rj′ = (EID, B1 , . . . , Bm ), respectively. Assume that n ≤ m. An extension of Di is a temporal instance Die = (De , ≺eA1 , . . . , ≺eAn ) of Ri such that (1) D ⊆ De , (2) ≺Ah ⊆≺eAh for all h ∈ [1, n], and (3) πEID (De ) = πEID (D). Intuitively, Die extends Di by adding new tuples for those entities that are already in Di . It does not introduce new entities. Consider two copy functions: ρ(j,i) imports tuples from Dj′ to Di , and ρe(j,i) from Dj′ to Die , both of signature ~ ⇐ Rj′ [B], ~ where A ~ = (A1 , . . . , An ) and B ~ is a sequence Ri [A] of n attributes in Rj′ . We say that ρe(j,i) extends ρ(j,i) if 1. Die is an extension of Di ; Corollary 4.6: For SP queries, CCQA(SP) is • in PTIME in the absence of denial constraints, and • coNP-complete (data complexity) and Πp2 -complete (combined complexity) in the presence of denial constraints, even for identity queries. ✷ Proof sketch: (1) In the absence of denial constraints, one canTverify that for any specification S and each SP query Q, Dc ∈Mod(S) Q LST(Dc ) can be obtained from (1) eval uating Q on a representation repr(S) of LST(Dc ) | Dc ∈ Mod(S) , where repr(S) compactly represents all possible latest values by special symbols; and (2) by removing all tu ples in Q repr(S) that contain such special symbols. Capitalizing on this property, we develop an algorithm that takes as input an SP query Q, a tuple t and a specification S without denial constraints. It checks whether t is a certain current answer to Q w.r.t. S as follows: (a) check whether S is consistent, and return “no” if not; (b) compute the representation repr(S); (c) check whether t is in the “stripped” version of Q repr(S) ; and (d) return “yes” if so, and “no” otherwise. Step (c) is PTIME for SP queries. By Theorem 4.4 and by leveraging the certain order computed by the algorithm for CPS, steps (a) and (b) are also in PTIME. Therefore, CCQA is in PTIME in this setting, for both combined complexity (when queries are not fixed) and data complexity. (2) In the presence of denial constraints, we show that CCQA 2. for each tuple t in Di , if ρ(j,i) (t) is defined, then so is ρe(j,i) (t) and moreover, ρe(j,i) (t) = ρ(j,i) (t); 3. for each tuple t in Die \Di , there exists a tuple s in Dj′ such that ρe(j,i) (t) = s. We refer to Die as the extension of Di by ρe(j,i) . Observe that Die is not allowed to expand arbitrarily. (a) Each new tuple t in Die is copied from a tuple s in Dj′ . (b) No new entity is introduced. Note that only copy functions that cover all attributes but EID of Ri can be extended. This assures that all the attributes of a new tuple are in place. An extension ρe of ρ is a collection of copy functions ρe(j,i) such that ρe 6= ρ and moreover, for all i ∈ [1, p] and j ∈ [1, q], either ρe(j,i) is an extension of ρ(j,i) , or ρe(j,i) = ρ(j,i) . We denote the set of all extensions of ρ as Ext(ρ). For each ρe in Ext(ρ), we denote as Se the extension of S by ρe , which consists of the same D′ and denial constraints as in S, but has copy functions ρe and De = (D1e , . . . , Dpe ), where Die is an extension of Di by ρe(j,i) for all j ∈ [1, q]. Currency preservation. We are now ready to define currency preservation. Consider a collection ρ of copy functions 78 s′1 : s′2 : s′3 : FN Mary Mary Mary LN Dupont Dupont Smith address 6 Main St 6 Main St 2 Small St salary 60k 80k 80k status married married divorced phone 6671975 6671975 2962845 Q. The next problem is to decide whether ρ in S can be extended to be currency preserving for Q. ECP(LQ ): INPUT: The existence problem. A query Q in LQ , and a consistent specification S with non-currency-preserving ρ QUESTION: Does there exist ρe in Ext(ρ) that is currency preserving for Q? Figure 2: Relation Mgr in a specification S. We say that ρ is currency preserving for a query Q w.r.t. S if (a) Mod(S) 6= ∅, and moreover, (b) for all ρe ∈ Ext(ρ) such that Mod(Se ) 6= ∅, we have that \ \ Q LST(Dce ) . Q LST(Dc ) = Dc ∈Mod(S) Bounded extension. We also want to know whether it suffices to extend ρ by copying additional data of a bounded size, and make it currency preserving. e Dc e ∈Mod(S ) Intuitively, ρ is currency preserving if (1) ρ is meaningful; and (2) for each extension ρe of ρ that makes sense, the certain current answers to Q are not improved by ρe , i.e., no matter what additional tuples are imported for those entities in D, the certain current answers to Q remain unchanged. BCP(LQ ): INPUT: The bounded copying problem. S, ρ and Q as in ECP, and a positive number k. QUESTION: Does there exist ρe ∈ Ext(ρ) such that ρe is currency preserving for Q and |ρe | ≤ k+|ρ|? Example 5.1: As shown in Fig. 2, relation Mgr collects manager records. Consider a specification S1 consisting of the following: (a) temporal instances Mgr of Fig. 2 and Emp of Fig. 1, in which partial currency orders are ∅ for all attributes; (b) denial constraints ϕ1 –ϕ3 of Example 2.1 and ϕ5 : ∀s, t : Mgr (s[EID] = t[EID] ∧ s[status] = “divorced” ∧ t[status] = “married”) → t ≺LN s , i.e., if s[status] is divorced and t[status] is married, then s is more current than t in LN; and (c) a copy function ρ with ~ ⇐ Mgr[A], ~ where A ~ is (FN, LN, address, signature Emp[A] salary, status), such that ρ(s3 ) = s′2 , i.e., s3 of Emp is copied from s′2 of Mgr. Obviously S1 is consistent. Recall query Q2 of Example 1.1, which is to find Mary’s current last name. For Q2 , ρ is not currency preserving. Indeed, there is an extension ρ1 of ρ by copying s′3 to Emp. In all consistent completions of the extension Emp1 of Emp by ρ1 , the answer to Q2 is Smith. However, the answer to Q2 in all consistent completions of Emp is Dupont (see Examples 1.1 and 2.5). In contrast, ρ1 is currency preserving for Q2 . Indeed, copying more tuples from Mgr (i.e., tuple s′1 ) to Emp does not change the answer to Q2 in Emp1 . ✷ 6. Deciding Currency Preservation We next study the decision problems in connection with currency-preserving copy functions, namely, CPP(LQ ), MCP(LQ ), ECP(LQ ) and BCP(LQ ) when LQ is CQ, UCQ, ∃FO+ or FO. We provide their combined complexity and data complexity, and identify special cases with lower complexity. Checking currency preservation. We first investigate CPP(LQ ), the problem of deciding whether a collection of copy functions in a given specification is currency preserving for a query Q. We show that CPP is nontrivial. Indeed, its combined complexity is already Πp3 -hard when Q is in CQ, and it is PSPACE-complete when Q is in FO. One might be tempted to think that fixing denial constraints would make our lives easier. Indeed, in practice denial constraints are often predefined and fixed, and only data, copy functions and query vary. Moreover, as shown in Theorem 4.1 for the consistency problem, fixing denial constraints indeed helps there. Unfortunately, it does not simplify the analysis of the combined complexity when it comes to CPP. Even when both query and denial constraints are fixed, the problem is Πp2 -complete (data complexity). Deciding currency preservation. There are several decision problems associated with currency-preserving copy functions, which we shall investigate in the rest of the paper. The first problem is to decide whether given copy functions have imported necessary current data for answering a query. Theorem 6.1: For CPP(LQ ), the combined complexity is • Πp3 -complete when LQ is CQ, UCQ or ∃FO+ , and • PSPACE-complete when LQ is FO. CPP(LQ ): INPUT: The currency preservation problem. A query Q in LQ , and a specification S of data currency with copy functions ρ. QUESTION: Is ρ currency preserving for Q? • Its data complexity is Πp2 -complete when LQ ∈ {CQ, UCQ, ∃FO+ ,FO}. The combined complexity bounds remain unchanged when denial constraints and copy functions are fixed. ✷ Minimal copying. Moreover, we want to know whether a currency preserving ρ does not copy unnecessary or redundant data. To this end, we use |ρ| to denote the sum of the sizes of data values copied by ρ. We say that ρ is minimal for Q if there exists no collection ρ′ of copy functions such that (1) ρ ∈ Ext(ρ′ ), (2) |ρ′ | < |ρ|, and (3) ρ′ is currency preserving for Q. That is, there exists no currency-preserving ρ′ that imports less data from D′ to D than ρ. Proof sketch: (1) Lower bounds. For the combined complexity, it suffices to show that CPP is already Πp3 -hard for CQ, while for FO, it is PSPACE-hard. For the data complexity, we show that CPP is Πp2 -hard with CQ queries. The Πp3 lower bound is verified by reduction from the ∗ ∗ ∗ ∀ ∃ ∀ 3sat problem, which is Πp3 -complete (cf. [28]). Given a sentence φ = ∀X∃Y ∀Zψ(X, Y, Z), we construct a query Q in CQ, and a specification S with (i) two data sources D and D′ , (ii) a single copy function ρ and (iii) a set of denial constraints. Source D consists of four relations encoding Boolean values, disjunction, conjunction and negation, as well as a relation Db for testing certain current query answers. Source D′ has a single relation Db′ from which some tuples are copied to Db by ρ. Query Q encodes φ by leverag- MCP(LQ ): INPUT: The minimal copying problem. S and Q as in CPP, with a collection ρ of currency-preserving copy functions in S. QUESTION: Is ρ minimal for Q? Extending copy functions. Consider a consistent specification S in which ρ is not currency preserving for a query 79 ing the relations in D for Boolean operations, and the property of ∀Zψ(X, Y, Z) explored in the proof of Theorem 4.1. We show that φ is true iff ρ is currency preserving for Q. When it comes to FO, we show that CPP is PSPACE-hard by a straightforward reduction from Q3SAT. In these proofs, both denial constraints and copy functions are fixed, independent of ∀∗ ∃∗ ∀∗ 3sat or Q3SAT instances. For the data complexity, we verify the Πp2 lower bound for CQ by reduction from the ∀∗ ∃∗ 3sat problem. Given a sentence φ = ∀X∃Y ψ(X, Y ), we define a query Q in CQ and a specification S such that φ is true iff ρ is currency preserving for Q w.r.t. S. Here Q is fixed, i.e., it is independent of φ. (2) Upper bounds. We give an algorithm that takes a query Q and a specification S as input, and checks whether the copy functions ρ̄ in S are currency preserving for Q. It invokes oracles to check whether S is consistent and whether some guessed tuples are certain current answers to Q in Dc ∈ Mod(S) but are not answers to Q in some Dce ∈ Mod(Se ), where Se is an extension of S by extending copy functions. The oracles are in Πp2 or Σp2 when Q is in ∃FO+ , and in PSPACE when Q is in FO. When Q is fixed, the oracles are in NP or coNP. From these the upper bounds follow. ✷ ification S in which copy functions ρ̄ are not currency preserving for Q, whether we can extend ρ̄ to preserve currency. The good news is that the answer to this question is affirmative: we can always extend ρ̄ and make them currency preserving for Q. Hence the decision problem ECP is in O(1) time, although it may take much longer time to explicitly construct a currency preserving extension of ρ̄. Proposition 6.3: ECP(LQ ) is decidable in O(1) time for both the combined complexity and data complexity, when LQ is CQ, UCQ, ∃FO+ or FO. ✷ Proof sketch: To show the existence of currency preserving extensions of ρ, we give an algorithm to construct one. Let D′ be the data source in S from which tuples can be copied. For all copy function ρ in ρ̄, if ρ can be extended, the algorithm expands ρ by copying tuples one by one from D′ , starting from the most current ones, as long as the extended instances satisfy the denial constraints in S and the constraints of the copy functions. It proceeds until no more tuples from D′ can be copied while satisfying the constraints, and yields a currency-preserving extension of ρ̄. ✷ Bounded extensions. In contrast to ECP, when it comes to deciding whether ρ̄ can be made currency-preserving by copying data within a bounded size, the analysis becomes far more intricate. Indeed, the result below tells us that even for CQ, BCP is Σp4 -hard, and fixing denial constraints and copy functions does not help. When both queries and denial constraints are fixed, BCP is Σp3 -complete. Minimal copying. We next study MCP(LQ ), which is to decide whether currency-preserving copy functions import any data that is unnecessary or redundant for a query Q. This problem is even harder than CPP: for CQ queries, p its combined complexity is ∆p4 -complete (P Σ3 ) even if denial constraints are fixed, and its data complexity is ∆p3 complete. For FO queries, it is still PSPACE-complete. Theorem 6.4: For BCP(LQ ), the combined complexity is • Σp4 -complete when LQ is CQ, UCQ or ∃FO+ , and Theorem 6.2: For MCP(LQ ), the combined complexity is • ∆p4 -complete when LQ is CQ, UCQ or ∃FO+ , and • PSPACE-complete when LQ is FO. • Its data complexity is ∆p3 -complete when LQ ∈ {CQ, UCQ, ∃FO+ ,FO}. The combined complexity bounds remain unchanged when denial constraints and copy functions are fixed. ✷ • PSPACE-complete when LQ is FO. • Its data complexity is Σp3 -complete when LQ ∈ {CQ, UCQ, ∃FO+ ,FO}. The combined complexity bounds remain unchanged when denial constraints and copy functions are fixed. ✷ Proof sketch: (1) Lower bounds. For FO, we show that BCP is PSPACE-hard by a simple reduction from Q3SAT. For CQ, we show that BCP is Σp4 -hard by reduction from the ∃∗ ∀∗ ∃∗ ∀∗ 3sat problem, which is Σp4 -complete (cf. [28]). Given a sentence φ = ∃W ∀X∃Y ∀Zψ(W, X, Y, Z), we define a CQ query Q, a positive number k and a specification S with two copy functions ρ̄. We show that φ is true iff there exists an extension ρ̄e of ρ̄ such that |ρe | ≤ |ρ|+k and ρ̄e is currency preserving for Q. We use temporal instances in S to encode disjunction and negation, which Q leverages to express φ. We also define a data source in S whose current instance ranges over all truth assignments for W , and use the data source to inspect possible extensions of ρ̄. For the data complexity, we verify the Σp3 lower bound for CQ by reduction from the ∃∗ ∀∗ ∃∗ 3sat problem. This reduction is more involved than the one developed for the combined complexity. Given an ∃∗ ∀∗ ∃∗ sentence φ, we define a query Q in CQ that is independent of φ but nonetheless, is able to encode the negations of the clauses in φ, by making use of temporal instances in a specification constructed. In all these proofs, denial constraints and copy functions are independent of input sentences, i.e., they are fixed. (2) Upper bounds. We develop an algorithm for BCP, which first guesses an extension ρ̄e of ρ̄ that copies more data of a Proof sketch: (1) Lower bounds. For FO, the proofs are similar to their counterparts of CPS. For CQ queries, we show that MCP is ∆p4 -hard (combined complexity) by reduction from the msa(∃∗ ∀∗ ∃∗ 3sat) problem, which is ∆p4 complete [26]. The latter problem is to determine, given a satisfiable sentence φ = ∃X∀Y ∃Z ψ(X, Y, Z), whether in the lexicographically maximum truth assignment µX for variables in X such that ∀Y ∃Z ψ(µX (x1 ), . . . , µX (xn ), Y, Z) evaluates to true, the last variable xn ∈ X has value 1, i.e., whether µX (xn ) = 1. For the data complexity, we verify that it is ∆p3 -hard for CQ by reduction from msa(∃∗ ∀∗ 3sat)problem. These reductions need to encode the lexicographical successor function for variables in X, and are more involved than those given in the proof of Theorem 6.1. (2) Upper bounds. We provide an algorithm that takes as input a query Q and a specification S with currency preserving copy functions ρ̄. For each tuple t copied by ρ̄, it checks whether removing t makes the copy functions not currency preserving, by using the algorithm for CPP in the proof of p Theorem 6.1 as an oracle. Hence MCP is in ∆p4 (P Σ3 , comp bined complexity) and in ∆3 (data complexity). ✷ The feasibility of currency preservation. We now consider ECP(LQ ). It is to decide, given a query Q and a spec- 80 bounded size, and then checks whether ρ̄e is currency preserving, by invoking the algorithm for CPP. From Theorem 6.1 the upper bounds for BCP follow immediately. ✷ (2) Lower bounds. We show that when k is fixed, BCP is already ∆p4 -hard (combined complexity) for CQ queries, by reduction from the msa(∃∗ ∀∗ ∃∗ 3sat) problem. In addition, we verify that it is ∆p3 -hard (data complexity) for CQ queries by reduction from the msa(∃∗ ∀∗ 3sat) problem. For FO queries, we show that it remains PSPACE-hard (combined complexity) by reduction from Q3SAT. ✷ Special cases. Theorem 6.1, 6.2 and 6.4 motivate us to explore special cases that simplify the analyses. In contrast to Theorem 4.1 and Corollary 4.2, we have seen that fixing denial constraints does not make our lives easier when it comes to CPP, MCP or BCP. However, when denial constraints are absent, these problems become tractable for SP queries. This is consistent with Corollary 4.6. 7. Conclusions We have proposed a model to specify the currency of data in the absence of reliable timestamps but in the presence of copy relationships. We have also introduced a notion of currency preservation to assess copy functions for query answering. We have identified eight fundamental problems associated with data currency and currency preservation (CPS, COP, DCIP, CCQA(LQ ), CPP(LQ ), MCP(LQ ), ECP(LQ ) and BCP(LQ )). We have provided a complete picture of the lower and upper bounds of these problems, all matching, for their data complexity as well as combined complexity when LQ ranges over a variety of query languages. These results are not only of theoretical interest in their own right, but may also help practitioners distinguish current values from stale data, answer queries with current data, and design proper copy functions to import data from external sources. The main complexity results are summarized in Tables 2 and 3, annotated with their corresponding theorems. The study of data currency is still preliminary. An open issue concerns generalizations of copy functions. To simplify the presentation we assume a single copy function from one relation to another. Nonetheless we believe that all the results remain intact when multiple such functions coexist. For currency-preserving copy functions, we assume that the signatures “cover” all attributes (except EID) of the importing relation. It is nontrivial to relax this requirement, however, since otherwise unknown values need to be introduced for attributes whose value is not provided by the extended copy functions. A second issue is about practical use of the study. As shown in Tables 2 and 3, most of the problems are intractable. To this end we expect to (a) identify practical PTIME cases in various applications, (b) develop efficient heuristic algorithms with certain performance guarantees, and (c) conduct incremental analysis when data or copy functions are updated, which is expected to result in a lower complexity than its batch counterpart when the area affected by the updates is small, as commonly found in practice. A third issue concerns the interaction between data consistency and data currency. There is an intimate connection between these two central issues of data quality. Indeed, identifying the current value of an entity helps resolve data inconsistencies, and conversely, repairing data helps remove obsolete data. While these processes should logically be unified, we are not aware of any previous work on this topic. Finally, it is interesting to develop syntactic characterizations of currency-preserving copy functions. Theorem 6.5: When denial constraints are absent, for SP queries both the combined complexity and the data complexity are in PTIME for CPP, MCP and BCP (when the bound k on the size of additional data copied is fixed). ✷ Proof sketch: (1) CPP. We develop a PTIME algorithm A that takes as input a specification S and an SP query Q defined on instances of a relation schema R. It checks whether the copy functions ρ̄ in S are currency preserving for Q. The main idea behind A is as follows. Let E be the set of all entities in Dt , where Dt is the temporal instance of R in S. Let C be the set of certain current answers to Q. By Corollary 4.6, C can be computed in PTIME. For each e ∈ E, A inspects tuples t in D′ one by one, to find whether it would “spoil” some answer s ∈ C produced by the current tuple of e if t were imported, i.e., adding t would make LST(e, Dc ) either empty or a distinct tuple that does not produce s via Q. Here D′ is the data source in S from which tuples are copied. When denial constraints are absent and Q is in SP, this can be done in PTIME. The algorithm then checks spoilers for all e ∈ E, to find whether there exists s ∈ C spoiled by importing tuples for all entities that yield s, or whether some spoilers introduce a new certain current answer. This can be done in PTIME since the entities in E are independent of each other. The algorithm returns “yes” iff no such spoilers exist. It is in PTIME as argued above. (2) MCP. Given A, an algorithm for MCP is immediate: for each “reduced” ρ̄c that removes one imported tuple from ρ̄, it checks whether ρ̄c is currency preserving for Q by using A. The algorithm is obviously in PTIME. (3) BCP. When the bound k is fixed, there are polynomially many extensions ρ̄e of ρ̄ such that |ρe | ≤ |ρ|+k. For each such ρ̄e we check whether ρ̄e is currency preserving for Q by invoking A. This can be done in PTIME. ✷ For BCP(LQ ), when the bound k is predefined and fixed, the analysis also becomes simpler when LQ is CQ, UCQ or ∃FO+ . For FO, however, BCP remains PSPACE-complete. Corollary 6.6: When k is fixed, BCP(LQ ) is • ∆p4 -complete (combined) for CQ, UCQ and ∃FO+ , • ∆p3 -complete (data) for CQ, UCQ and ∃FO+ , but is ✷ • PSPACE-complete for FO (combined complexity). Proof sketch: (1) Upper bounds. As remarked earlier, when k is fixed there are polynomially many extensions ρ̄e of ρ̄ such that |ρe | ≤ |ρ|+k. For each such ρ̄e we check whether ρ̄e is currency preserving, by invoking the algorithm for checking CPP given in the proof of Theorem 6.1. Therep fore, BCP is in P Σ3 = ∆p4 for ∃FO+ (combined complexity). Moreover, when queries and denial constraints are fixed, it is p in P Σ3 = ∆p3 (data complexity). For FO, the CPP checking is in PSPACE, and hence so is the upper bound for BCP. Acknowledgments. Fan and Geerts are supported in part by an IBM scalable data analytics for a smarter planet innovation award, and the RSE-NSFC Joint Project Scheme. 8. References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 81 Data complexity Combined complexity Special case Combined and data complexity CPS COP DCIP NP-complete (Th 4.1) coNP-complete (Cor 4.2) coNP-complete (Cor 4.2) Σp2 -complete (Th 4.1) Πp2 -complete (Cor 4.2) Πp2 -complete (Cor 4.2) In the absence of denial constraints PTIME (Th 4.4) PTIME (Th 4.4) PTIME (Th 4.4) Table 2: Complexity of problems for reasoning about data currency (CPS, COP, DCIP) Complexity Data Combined (LQ ) CQ, UCQ, ∃FO+ FO Special case Combined & data CCQA(LQ ) coNP-complete (Th 4.3) Πp2 -complete (Th 4.3) PSPACEcomplete (Th 4.3) PTIME (Cor 4.6) CPP(LQ ) Πp2 -complete (Th 6.1) MCP(LQ ) ∆p2 -complete (Th 6.2) ECP(LQ ) O(1) (Prop 6.3) BCP(LQ ) Σp3 -complete (Th 6.4) Πp3 -complete (Th 6.1) ∆p3 -complete (Th 6.2) O(1) (Prop 6.3) PSPACEPSPACEO(1) (Prop 6.3) complete (Th 6.1) complete (Th 6.2) SP queries in the absence of denial constraints PTIME (Th 6.5) PTIME (Th 6.5) O(1) (Prop 6.3) Σp4 -complete (Th 6.4) PSPACEcomplete (Th 6.4) PTIME (Th 6.5) Table 3: Complexity of problems for query answering and for determining currency preservation [18] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [19] G. Grahne. The Problem of Incomplete Information in Relational Databases. Springer, 1991. [20] M. Grohe and G. Schwandtner. The complexity of datalog on linear orders. Logical Methods in Computer Science, 5(1), 2009. [21] T. Imieliński and W. Lipski, Jr. Incomplete information in relational databases. JACM, 31(4), 1984. [22] Knowledge Integrity. Two sides to data decay. DM Review, 2003. [23] P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, 2005. [24] M. Koubarakis. Database models for infinite and indefinite temporal information. Inf. Syst., 19(2):141–173, 1994. [25] M. Koubarakis. The complexity of query evaluation in indefinite temporal constraint databases. TCS, 171(12):25–60, 1997. [26] M. W. Krentel. Generalizations of Opt P to the polynomial hierarchy. TCS, 97(2):183–198, 1992. [27] M. Lenzerini. Data integration: A theoretical perspective. In PODS, 2002. [28] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. [29] E. Schwalb and L. Vila. Temporal constraints: A survey. Constraints, 3(2-3):129–149, 1998. [30] R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, 1999. [31] R. van der Meyden. The complexity of querying indefinite data about linearly ordered domains. JCSS, 54(1), 1997. [32] R. van der Meyden. Logical approaches to incomplete information: A survey. In J. Chomicki and G. Saake, editors, Logics for Databases and Information Systems. Kluwer, 1998. [33] V. Vianu. Dynamic functional dependencies and database aging. J. ACM, 34(1):28–59, 1987. [34] H. Zhang, Y. Diao, and N. Immerman. Recognizing patterns in streams with imprecise timestamps. In VLDB, 2010. [2] L. Berti-Equille, A. D. Sarma, X. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009. [3] L. Bertossi. Consistent query answering in databases. SIGMOD Rec., 35(2), 2006. [4] M. Bodirsky and J. Kára. The complexity of temporal constraint satisfaction problems. JACM, 57(2), 2010. [5] P. Buneman, J. Cheney, W. Tan, and S. Vansummeren. Curated databases. In PODS, 2008. [6] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009. [7] J. Chomicki. Consistent query answering: Five easy pieces. In ICDT, 2007. [8] J. Chomicki and D. Toman. Time in database systems. In M. Fisher, D. Gabbay, and L. Vila, editors, Handbook of Temporal Reasoning in Artificial Intelligence. Elsevier, 2005. [9] J. Clifford, C. E. Dyreson, T. Isakowitz, C. S. Jensen, and R. T. Snodgrass. On the semantics of “now” in databases. TODS, 22(2):171–214, 1997. [10] E. F. Codd. Extending the database relational model to capture more meaning. TODS, 4(4):397–434, 1979. [11] A. Deutsch, A. Nash, and J. B. Remmel. The chase revisited. In PODS, 2008. [12] X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010. [13] X. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. In VLDB, 2009. [14] C. E. Dyreson, C. S. Jensen, and R. T. Snodgrass. Now in temporal databases. In L. Liu and M. T. Özsu, editors, Encyclopedia of Database Systems. Springer, 2009. [15] W. W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Data Warehousing Institute, 2002. [16] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1– 16, 2007. [17] W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. TKDE, 23(4):683– 698, 2011. 82

Log In

Determining the currency of data

Related papers

Related papers