Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/773153.773168acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Correlating XML data streams using tree-edit distance embeddings

Published: 09 June 2003 Publication History

Abstract

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing an upper bound of O(log2 n log* n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to: (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and, (2) approximate the result of tree-edit distance similarity joins over continuous XML document streams. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.

References

[1]
Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadeplphia, Pennsylvania, May 1999.
[2]
Noga Alon, Yossi Matias, and Mario Szegedy. "The Space Complexity of Approximating the Frequency Moments". In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, pages 20--29, Philadelphia, Pennsylvania, May 1996.
[3]
Alberto Apostolico and Zvi Galil, editors. "Pattern Matching Algorithms". Oxford University Press, 1997.
[4]
Ziv Bar-Yossef, T.S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. "Counting distinct elements in a data stream". In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM'02), Cambridge, Massachusetts, September 2002.
[5]
Chee-Yong Chan, Pascal Felber, Minos Garofalakis, and Rajeev Rastogi. "Efficient Filtering of XML Documents with XPath Expressions". In Proceedings of the Eighteenth International Conference on Data Engineering, San Jose, California, February 2002.
[6]
Moses Charikar, Kevin Chen, and Martin Farah-Colton. "Finding Frequent Items in Data Streams". In Proceedings of the International Colloquium on Automata, Languages, and Programming, Malaga, Spain, July 2002.
[7]
Moses Charikar and Amit Sahai. "Dimension Reduction in the l1 Norm". In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, Vancouver, Canada, November 2002.
[8]
Graham Cormode, Mayur Datar, Piotr Indyk, and S. Muthukrishnan. "Comparing Data Streams Using Hamming Norms (How to Zero In)". In Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 2002.
[9]
Graham Cormode and S. Muthukrishnan. "The String Edit Distance Matching Problem with Moves". In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, January 2002.
[10]
Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. "Processing Complex Aggregate Queries over Data Streams". In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 2002.
[11]
Joan Feigenbaum, Sampath Kannan, Martin Strauss, and Mahesh Viswanathan. "An Approximate L1-Difference Algorithm for Massive Data Streams". In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science, New York City, New York, October 1999.
[12]
Minos Garofalakis and Amit Kumar. "Correlating XML Data Streams Using Tree-Edit Distance Embeddings". Bell Labs Technical Memorandum, March 2003.
[13]
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. "Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries". In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001.
[14]
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 2002.
[15]
Michael Greenwald and Sanjeev Khanna. "Space-Efficient Online Computation of Quantile Summaries". In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, California, May 2001.
[16]
Sudipto Guha, H.V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. "Approximate XML Joins". In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 2002.
[17]
Piotr Indyk. "Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation". In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, pages 189--197, Redondo Beah, California, November 2000.
[18]
Piotr Indyk. "Algorithmic Aspects of Geometric Embeddings". In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science, Las Vegas, Nevada, October 2001.
[19]
Richard M. Karp and Michael O. Rabin. "Efficient Randomized Pattern-Matching Algorithms". IBM Journal of Research and Development, 31(2):249--260, March 1987.
[20]
Donald E. Knuth. "The Art of Computer Programming (Vol. 1 / Fundamental Algorithms)". Reading, Mass. : Addison-Wesley Pub. Co., 1973.
[21]
Gurmeet Singh Manku and Rajeev Motwani. "Approximate Frequency Counts over Data Streams". In Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 2002.
[22]
Nitin Thaper, Sudipto Guha, Piotr Indyk, and Nick Koudas. "Dynamic Multidimensional Histograms". In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 2002.

Cited By

View all
  • (2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
  • (2022)JEDI: These aren't the JSON documents you're looking for...Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517850(1584-1597)Online publication date: 10-Jun-2022
  • (2019)Effective Filters and Linear Time Verification for Tree Similarity Joins2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00081(854-865)Online publication date: Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2003
291 pages
ISBN:1581136706
DOI:10.1145/773153
  • Conference Chair:
  • Frank Neven,
  • General Chair:
  • Catriel Beeri,
  • Program Chair:
  • Tova Milo
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2003

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS03

Acceptance Rates

PODS '03 Paper Acceptance Rate 27 of 136 submissions, 20%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
  • (2022)JEDI: These aren't the JSON documents you're looking for...Proceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517850(1584-1597)Online publication date: 10-Jun-2022
  • (2019)Effective Filters and Linear Time Verification for Tree Similarity Joins2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00081(854-865)Online publication date: Apr-2019
  • (2018)Tree2Vector: Learning a Vectorial Representation for Tree-Structured DataIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2018.279706029:11(5304-5318)Online publication date: Nov-2018
  • (2016)Online Distance Measurement for Tree Data Event Streams2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)10.1109/DASC-PICom-DataCom-CyberSciTec.2016.122(681-688)Online publication date: Aug-2016
  • (2014)A general algorithm for subtree similarity-search2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816712(928-939)Online publication date: Mar-2014
  • (2014)A new similarity measure for subject hierarchical structuresJournal of Documentation10.1108/JD-12-2012-016070:3(364-391)Online publication date: 6-May-2014
  • (2013)Indexing for subtree similarity-search using edit distanceProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2463716(49-60)Online publication date: 22-Jun-2013
  • (2012)The address connector: noninvasive synchronization of hierarchical data sourcesKnowledge and Information Systems10.1007/s10115-012-0582-x37:3(639-663)Online publication date: 11-Nov-2012
  • (2012)Windowed pq-grams for approximate joins of data-centric XMLThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-011-0254-621:4(463-488)Online publication date: 1-Aug-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media