Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

The pq-gram distance between ordered labeled trees

Published: 15 February 2008 Publication History

Abstract

When integrating data from autonomous sources, exact matches of data items that represent the same real-world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ.
We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real-world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams.

References

[1]
Al-Khalifa, S., Jagadish, H. V., Patel, J. M., Wu, Y., Koudas, N., and Srivastava, D. 2002. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Science Press, 141--152.
[2]
Augsten, N., Böhlen, M., Dyreson, C., and Gamper, J. 2008. Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, 814--823.
[3]
Augsten, N., Böhlen, M., and Gamper, J. 2004. Reducing the integration of public administration databases to approximate tree matching. In Proceedings of the 3rd International Conference on Electronic Government. R. Traunmüller Ed., Lecture Notes in Computer Science vol. 3183. Springer, 102--107.
[4]
Augsten, N., Böhlen, M., and Gamper, J. 2005. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB). ACM Press, 301--312.
[5]
Bruno, N., Koudas, N., and Srivastava, D. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 310--321.
[6]
Celko, J. 1994. Trees, databases and SQL. Datab. Program. Des. 7, 10, 48--57.
[7]
Celko, J. 2004. Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann, San Francisco, CA.
[8]
Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., and Widom, J. 1996. Change detection in hierarchically structured information. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 493--504.
[9]
Chen, W. 2001. New algorithm for ordered tree-to-tree correction problem. J. Algor. 40, 2, 135--158.
[10]
Cobéna, G., Abiteboul, S., and Marian, A. 2002. Detecting changes in XML documents. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Science Press, 41--52.
[11]
Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering XML documents by structure. Inform. Syst. 31, 3, 187--228.
[12]
DeHaan, D., Toman, D., Consens, M. P., and Özsu, M. T. 2003. A comprehensive XQuery to SQL translation using dynamic interval encoding. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 623--634.
[13]
Demaine, E. D., Mozes, S., Rossman, B., and Weimann, O. 2007. An optimal decomposition algorithm for tree edit distance. In Proceedings of the 34th International Colloquium on Automata, Languages and Programming (ICALP'07). Lecture Notes in Computer Science, vol. 4596. Springer, 146--157.
[14]
Flesca, S., Manco, G., Masciari, E., Pontieri, L., and Pugliese, A. 2005. Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Engin. 17, 2, 160--175.
[15]
Garofalakis, M. and Kumar, A. 2005. XML stream processing using tree-edit distance embeddings. ACM Trans. Datab. Syst. 30, 1, 279--332.
[16]
Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 491--500.
[17]
Grust, T. 2002. Accelerating XPath location steps. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 109--120.
[18]
Guha, S., Jagadish, H. V., Koudas, N., Srivastava, D., and Yu, T. 2002. Approximate XML joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 287--298.
[19]
Helmer, S. 2007. Measuring the structural similarity of semistructured documents using entropy. In Proceedings of the International Conference on Very Large Databases (VLDB). ACM Press, 1022--1032.
[20]
Jiang, H., Wang, W., Lu, H., and Yu, J. X. 2003. Holistic twig joins on indexed XML documents. In Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 273--284.
[21]
Jiang, T., Wang, L., and Zhang, K. 1995. Alignment of trees—An alternative to tree edit. Theor. Comput. Sci. 143, 1, 137--148.
[22]
Klein, P. N. 1998. Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th European Symposium on Algorithms. Lecture Notes in Computer Science, vol. 1461. Springer, 91--102.
[23]
Lee, K.-H., Choy, Y.-C., and Cho, S.-B. 2004. An efficient algorithm to compute differences between structured documents. IEEE Trans. Knowl. Data Engin. 16, 8, 965--979.
[24]
Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88.
[25]
Nierman, A. and Jagadish, H. V. 2002. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB).
[26]
Ohkura, N., Hirata, K., Kuboyama, T., and Harao, M. 2005. The q-gram distance for ordered unlabeled trees. In Proceedings of the International Conference on Discovery Science (DS). A. G. Hoffmann, H. Motoda, and T. Scheffer Eds., Lecture Notes in Computer Science, vol. 3735. Springer, 189--202.
[27]
O'Neil, P., O'Neil, E., Pal, S., Cseri, I., Schaller, G., and Westbury, N. 2004. ORDPATHs: Insert-Friendly XML node labels. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 903--908.
[28]
Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 263--274.
[29]
Puhlmann, S., Weis, M., and Naumann, F. 2006. XML duplicate detection using sorted neighborhoods. In Proceedings of the International Conference on Extending Database Technology (EDBT). Lecture Notes in Computer Science, vol. 3896. Springer, 773--791.
[30]
Ribeiro, L. and Härder, T. 2008. Evaluating performance and quality of XML-based similarity joins. In Proceedings of the Conference on Advances in Databases and Information Systems (ADBIS). Lecture Notes in Computer Science, vol. 5207. Springer, 246--261.
[31]
Sanz, I., Mesiti, M., Guerrini, G., and Berlanga, R. 2008. Fragment-Based approximate retrieval in highly heterogeneous XML collections. Data Knowl. Engin. 64, 1, 266--293.
[32]
Selkow, S. M. 1977. The tree-to-tree editing problem. Inform. Process. Lett. 6, 6, 184--186.
[33]
Tai, K.-C. 1979. The tree-to-tree correction problem. J. ACM 26, 3, 422--433.
[34]
Tatarinov, I., Viglas, S., Beyer, K. S., Shanmugasundaram, J., Shekita, E. J., and Zhang, C. 2002. Storing and querying ordered XML using a relational database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 204--215.
[35]
Ukkonen, E. 1992. Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 1, 191--211.
[36]
Valiente, G. 2001. An efficient bottom-up distance between trees. In Proceedings of the 8th Symposium on String Processing and Information Retrieval. IEEE Computer Science Press, 212--219.
[37]
van Rijsbergen, C. J. 1979. Information Retrieval, 2nd Ed. Butterworth-Heinemann, UK.
[38]
Weis, M. and Naumann, F. 2005. DogmatiX tracks down duplicates in XML. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 431--442.
[39]
Yang, R., Kalnis, P., and Tung, A. K. H. 2005. Similarity evaluation on tree-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 754--765.
[40]
Yang, W. 1991. Identifying syntactic differences between two programs. Softw. Pract. Exper. 21, 7, 739--755.
[41]
Yianilos, P. N. 1991,2002. Normalized forms for two common metrics. Tech. rep., NEC Research Institute.
[42]
Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2006. Similarity Search—The Metric Space Approach. Advances in Database Systems, vol. 32. Springer.
[43]
Zhang, C., Naughton, J. F., DeWitt, D. J., Luo, Q., and Lohman, G. M. 2001. On supporting containment queries in relational database management systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, 425--436.
[44]
Zhang, K. 1995. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recogn. 28, 3, 463--474.
[45]
Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6, 1245--1262.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 35, Issue 1
February 2010
310 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1670243
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Accepted: 01 August 2009
Revised: 01 May 2009
Received: 01 November 2008
Published: 15 February 2008
Published in TODS Volume 35, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Database algorithms
  2. XML
  3. approximate matching
  4. distance metric
  5. hierarchical data
  6. similarity search
  7. tree edit distance

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Simple and efficient Hash sketching for tree-structured dataExpert Systems with Applications10.1016/j.eswa.2024.125973267(125973)Online publication date: Apr-2025
  • (2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 1-Mar-2024
  • (2023)Employing Source Code Quality Analytics for Enriching Code Snippets DataData10.3390/data80901408:9(140)Online publication date: 31-Aug-2023
  • (2023)Feedforward-Aided Course Designs for Similarity SearchProceedings of the 2nd International Workshop on Data Systems Education: Bridging education practice with education research10.1145/3596673.3596974(14-17)Online publication date: 23-Jun-2023
  • (2023)DualTaxoVec: Web user embedding and taxonomy generationKnowledge-Based Systems10.1016/j.knosys.2023.110565271(110565)Online publication date: Jul-2023
  • (2023)Using a Conceptual Model in Plug-and-Play SQLConceptual Modeling10.1007/978-3-031-47262-6_8(145-161)Online publication date: 6-Nov-2023
  • (2023)Towards Extracting Reusable and Maintainable Code SnippetsSoftware Technologies10.1007/978-3-031-37231-5_9(187-206)Online publication date: 19-Jul-2023
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)PQ-Diff: A Business Process Difference Detection and Interpretation Method based on the Common Key Structure2022 IEEE International Conference on Services Computing (SCC)10.1109/SCC55611.2022.00037(185-195)Online publication date: Jul-2022
  • (2021)Top-k Tree Similarity JoinProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482304(1939-1948)Online publication date: 26-Oct-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media