Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Efficiently repairing and measuring replica consistency in distributed databases

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

In a distributed database, maintaining large table replicas with frequent asynchronous insertions is a challenging problem that requires carefully managing a tradeoff between consistency and availability. With that motivation in mind, we propose efficient algorithms to repair and measure replica consistency. Specifically, we adapt, extend and optimize distributed set reconciliation algorithms to efficiently compute the symmetric difference between replicated tables in a distributed relational database. Our novel algorithms enable fast synchronization of replicas being updated with small sets of new records, measuring obsolence of replicas having many insertions and deciding when to update a replica, as each table replica is being continuously updated in an asynchronous manner. We first present an algorithm to repair and measure distributed consistency on a large table continuously updated with new records at several sites when the number of insertions is small. We then present a complementary algorithm that enables fast synchronization of a summarization table based on foreign keys when the number of insertions is large, but happening on a few foreign key values. From a distributed systems perspective, in the first algorithm the large table with data is reconciled, whereas in the second case, its summarization table is reconciled. Both distributed database algorithms have linear communication complexity and cubic time complexity in the size of the symmetric difference between the respective table replicas they work on. That is, they are effective when the network speed is smaller than CPU speed at each site. A performance experimental evaluation with synthetic and real databases shows our algorithms are faster than a previous state-of-the art algorithm as well as more efficient than transferring complete tables, assuming large replicated tables and sporadic asynchronous insertions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. In the real-world example, where the actual finite field sizes are many orders of magnitude greater than in our toy-size example above, there are practical advantages to picking a prime q that is close to, i.e., only slightly greater than, the applicable value of v. So, ideally, one would want to pick the smallest prime strictly greater than v, but any prime greater than v would mathematically still work out just fine.

References

  1. Abadi, D.: Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. Computer 45(2), 37–42 (2012)

    Article  MathSciNet  Google Scholar 

  2. Akal, F., Türker, C., Schek, H.-J., Breitbart, Y., Grabs, T., Veen, L.: Fine-grained replication and scheduling with freshness and correctness guarantees. In: VLDB’05: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 565–576 (2005)

    Google Scholar 

  3. Albrecht, J., Lehner, W.: On-line analytical processing in distributed data warehouses. In: IDEAS’98, p. 78. IEEE Computer Society Press, Los Alamitos (1998)

    Google Scholar 

  4. Bernardino, J., Madeira, H.: Experimental evaluation of a new distributed partitioning technique for data warehouses. In: Int’l Symposium on Database Engineering and Applications (IDEAS’01), pp. 312–321 (2001)

    Google Scholar 

  5. Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S., Silberschatz, A.: Update propagation protocols for replicated databases. In: SIGMOD’99, pp. 97–108 (1999)

    Google Scholar 

  6. Coelho, F.: Pg comparator. http://pgfoundry.org/projects/pg-comparator (2012), consulted on April 2012

  7. Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 3rd edn. Addison/Wesley, Redwood City (2000)

    Google Scholar 

  8. Gallersdörfer, R., Nicola, M.: Improving performance in replicated databases through relaxed coherency. In: VLDB’95, pp. 445–456 (1995)

    Google Scholar 

  9. Gançarski, S., Naacke, H., Pacitti, E., Valduriez, P.: The leganet system: freshness-aware transaction routing in a database cluster. Inf. Syst. 32(2), 320–343 (2007)

    Article  Google Scholar 

  10. García-García, J., Ordonez, C.: Consistency-aware evaluation of OLAP queries in replicated data warehouses. In: ACM DOLAP’09, pp. 73–80 (2009)

    Google Scholar 

  11. García-García, J., Ordonez, C.: Extended aggregations for databases with referential integrity issues. Data Knowl. Eng. 69(1), 73–95 (2010)

    Article  Google Scholar 

  12. Golub, G.H., van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins Univ. Press, Baltimore (1996)

    MATH  Google Scholar 

  13. Gray, J., Helland, P., O’Neil, P., Shasha, D.: The dangers of replication and a solution. In: SIGMOD’96, pp. 173–182 (1996)

    Google Scholar 

  14. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 1st edn. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  15. Hauglid, J.O., Ryeng, N.H., Nørvåg, K.: DYFRAM: dynamic fragmentation and replica management in distributed database systems. Distrib. Parallel Databases 28, 157–185 (2010)

    Article  Google Scholar 

  16. Huang, J., Naughton, J.F., Livny, M.: TRAC: toward recency and consistency reporting in a database with distributed data sources. In: VLDB, pp. 223–234 (2006)

    Google Scholar 

  17. Labio, W., Garcia-Molina, H.: Efficient snapshot differential algorithms for data warehousing. In: VLDB’96, pp. 63–74 (1996)

    Google Scholar 

  18. Lau, E., Madden, S.: An integrated approach to recovery and high availability in an updateable, distributed data warehouse. In: VLDB’06, pp. 703–714 (2006)

    Google Scholar 

  19. Lidl, R., Niederreiter, H.: Finite Fields, 2nd edn. Encyclopedia of Mathematics and Its Applications, vol. 20. Cambridge University Press, Cambridge (1997)

    Google Scholar 

  20. Lima, A.A.B., Furtado, C., Valduriez, P., Mattoso, M.: Parallel OLAP query processing in database clusters with data replication. Distrib. Parallel Databases 25)(1–2), 97–123 (2009)

    Article  Google Scholar 

  21. Minsky, Y., Trachtenberg, A., Zippel, R.: Set reconciliation with nearly optimal communication complexity. IEEE Trans. Inf. Theory 49(9), 2213–2218 (2003)

    Article  MathSciNet  Google Scholar 

  22. Olston, C., Widom, J.: Offering a precision-performance tradeoff for aggregation queries over replicated data. In: VLDB, pp. 144–155 (2000)

    Google Scholar 

  23. Ordonez, C., García-García, J.: Referential integrity quality metrics. Decis. Support Syst. 44(2), 495–508 (2008)

    Article  Google Scholar 

  24. Ordonez, C., García-García, J., Chen, Z.: Measuring referential integrity in distributed databases. In: ACM First Workshop on Cyber Infrastructure: Information Management in EScience, CIMS, pp. 61–66 (2007)

    Google Scholar 

  25. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, Berlin (2011)

    Google Scholar 

  26. Pacitti, E., Minet, P., Simon, E.: Replica consistency in lazy master replicated databases. Distrib. Parallel Databases 9(3), 237–267 (2001)

    Article  MATH  Google Scholar 

  27. Pacitti, E., Simon, E.: Update propagation strategies to improve freshness in lazy master replicated databases. VLDB J. 8(3–4), 305–318 (2000)

    Article  Google Scholar 

  28. Pu, C., Leff, A.: Replica control in distributed systems: as asynchronous approach. In: SIGMOD’91, Proc. of the 1991 ACM SIGMOD Int’l Conf. on Management of Data. ACM, New York (1991)

    Google Scholar 

  29. Saito, Y., Shapiro, M.: Optimistic replication. ACM Comput. Surv. 37, 42–81 (2005)

    Article  Google Scholar 

  30. Shoup, V.: A Computational Introduction to Number Theory and Algebra. Cambridge University Press, New York (2005)

    Book  MATH  Google Scholar 

  31. Shoup, V.: NTL: a library for doing number theory. http://www.shoup.net/ntl/ (2008)

  32. Starobinski, D., Trachtenberg, A.: Boston University Laboratory of Networking and Information Systems. http://ipsit.bu.edu/nislab/projects/cpisync/download.htm (2008)

  33. TPC: TPC-H benchmark. Transaction Processing Performance Council. http://www.tpc.org/tpch (2005)

  34. Wiesmann, M., Pedone, F., Schiper, A., Kemme, B., Alonso, G.: Understanding replication in databases and distributed systems. In: Proc. of the 20th Int’l Conference on Distributed Computing Systems, pp. 464–474 (2000)

    Chapter  Google Scholar 

  35. Yu, H., Vahdat, A.: Design and evaluation of a continuous consistency model for replicated services. In: OSDI’00, p. 21 (2000)

    Google Scholar 

  36. Yu, H., Vahdat, A.: The costs and limits of availability for replicated services. ACM Trans. Comput. Syst. 24(1), 70–113 (2006)

    Article  Google Scholar 

Download references

Acknowledgements

The second author was partially supported by NSF grant IIS 0914861. The third author is currently working for Microsoft Corp.

We would like to thank Sergio Rajsbaum for an inspiring conversation that derived on an earlier version of this article. We would also like to thank Claudia Morales-Almonte and Rogelio Montero-Campos for programming our algorithms to conduct the experimental evaluation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier García-García.

Appendix

Appendix

Lemma 1

IfD i , i=0,…,N−1, \(\operatorname {cur}(D_{i}.T_{p}) = 0\) (i.e., if all replicas of table T p are replica consistent), then \(\operatorname {gcur}(T_{p}) = 0\) (that is, table T p is global replica consistent).

Proof

If ∀D i , i=0,…,N−1, \(\operatorname {cur}(D_{i}.T_{p}) = 0\), then by Eqs. (1) and (2) that

(13)

Since \(|\bigcup_{\substack{i=0}}^{N-1}D_{i}.T_{p}| = |D_{0}.T_{p} \cup ( \bigcup_{\substack{i=1}}^{N-1}(D_{i}.T_{p}- D_{0}.T_{p}) )|\) and \(|\bigcap_{\substack{i=0}}^{N-1}D_{i}.T_{p}| = |D_{0}.T_{p} - ( \bigcup_{\substack{i=1}}^{N-1}(D_{0}.T_{p}- D_{i}.T_{p}) )|\), by virtue of Eq. (13) we have \(|\bigcup_{\substack {i=0}}^{N-1}D_{i}.T_{p}| = |D_{0}.T_{p} |\) and \(|\bigcap_{\substack{i=0}}^{N-1}D_{i}.T_{p}| = |D_{0}.T_{p} |\), and consequently, by definition of \(\operatorname {gcur}()\) (Eq. (3)), it follows that

$$ \operatorname {gcur}(T_p) = 0 \\[15pt] $$
(14)

 □

Lemma 2

(15)

Proof

By definition \(| \mathcal{T}_{r\cup} \Join_{K} \mathcal{T}_{s\cup} |\) is the number of tuples in \(\mathcal{T}_{r\cup}\) having their foreign key K values match a primary key K value in \(\mathcal{T}_{s\cup}\). \(\pi_{K} (\mathcal{T}_{r\cup}) - \pi_{K} (\mathcal{T}_{s\cup})\) is a one attribute relation with values in \(\pi_{K} (\mathcal{T}_{r\cup})\) that are not in \(\pi_{K} (\mathcal{T}_{s\cup})\), so \((\pi_{K} (\mathcal{T}_{r\cup }) - \pi_{K} (\mathcal{T}_{s\cup})) \Join_{K} \mathcal{T}_{r\cup}\) are the tuples in \(\mathcal{T}_{r\cup}\) that do not have a foreign key value matching a value in \(\pi_{K} (\mathcal{T}_{s\cup})\). Observe that in these tuples, K is not null.

Consequently, \(|\mathcal{T}_{r\cup}| - | (\pi_{K} (\mathcal{T}_{r\cup}) - \pi_{K} (\mathcal{T}_{s\cup})) \Join_{K} \mathcal{T}_{r\cup} |\) is the number of tuples in \(\mathcal{T}_{r\cup}\) with their foreign key K values matching a primary key K value in \(\mathcal{T}_{s\cup}\) plus the tuples with a null value in foreign key K. The last subtraction, \(- | \sigma_{\text{isnull}(K)} (\mathcal{T}_{r\cup})|\), is to discount these tuples. □

To evaluate the above expression, among other computations we need the difference \(\pi_{K} ( \mathcal{T}_{r\cup}) - \pi_{K} ( \mathcal {T}_{s\cup})\). This expression evaluates a table with the invalid foreign key values—that is, a table with the inclusion dependency violations. We note that, if the difference is low, this table difference would be relatively small compared to the size of \(\mathcal{T}_{s\cup}\) or even \(\pi_{K} ( \mathcal{T}_{s\cup})\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-García, J., Ordonez, C. & Tosic, P.T. Efficiently repairing and measuring replica consistency in distributed databases. Distrib Parallel Databases 31, 377–411 (2013). https://doi.org/10.1007/s10619-012-7116-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-012-7116-0

Keywords