research-article

Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation

Authors:

Raul Castro FernandezAuthors Info & Claims

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 1504 - 1517

https://doi.org/10.1145/3514221.3517891

Published: 11 June 2022 Publication History

Abstract

In this paper, we present Leva, an end-to-end system that boosts the performance of machine learning tasks over relational data. Leva builds a relational embedding by representing relational data as a graph and then using embedding methods to represent the graph as vectors. The embedding represents information from the entire database, including useful information for the downstream machine learning task. At the same time, some information in the graph will be erroneous, for example, corresponding to incorrect inclusion dependencies. However, we show that the supervision signal from the downstream task filters out information that is not useful. The result is a boost in ML performance. This result means that it is possible for analysts to avoid the time-consuming effort of collecting features across multiple relations-which requires solving a data discovery and integration problem-and instead rely on these techniques to train better-performing models. We demonstrate Leva's performance on different classification and regression datasets and compare it with multiple other baselines.

Supplemental Material

MP4 File

Presentation video.

Download
29.73 MB

References

[1]

Anna Atramentov, Hector Leiva, and Vasant Honavar. 2003. A multi-relational decision tree learning algorithm--implementation and experiments. In International Conference on Inductive Logic Programming. Springer, 38--56.

[2]

Petr Berka. 1999. Workshop notes on Discovery Challenge PKDD'99. http://lisp.vse.cz/pkdd99/

[3]

Hendrik Blockeel, Savso Dvzeroski, Boris Kompare, Stefan Kramer, and Bernhard Pfahringer. 2004. Experiments In Predicting Biodegradability. Applied Artificial Intelligence, Vol. 18, 2 (2004), 157--181. https://doi.org/10.1.1.2.3797

[4]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), Vol. 36, 4 (1989), 929--965.

Digital Library

[5]

Rajesh Bordawekar and Oded Shmueli. 2017. Using Word Embedding to Enable Semantic Queries in Relational Databases. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning. ACM, Chicago IL USA, 1--4. https://doi.org/10.1145/3076246.3076251

Digital Library

[6]

Antoine Bordes, Nicolas Usunier, et al. 2013. Translating Embeddings for Modeling Multi-relational Data. In NIPS .

[7]

Antoine Bordes, Jason Weston, et al. 2011. Learning Structured Embeddings of Knowledge Bases. In AAAI .

[8]

Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (Melbourne, Australia) (CIKM '15). Association for Computing Machinery, New York, NY, USA, 891--900. https://doi.org/10.1145/2806416.2806512

Digital Library

[9]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland OR USA, 1335--1349. https://doi.org/10.1145/3318464.3389742

Digital Library

[10]

R. Castro Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. 2018. Aurum: A Data Discovery System. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1001--1012. https://doi.org/10.1109/ICDE.2018.00094

[11]

Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1190--1201. https://doi.org/10.1109/ICDE.2019.00109

[12]

Jie Cheng, Christos Hatzis, Hisashi Hayashi, Mark-André Krogel, Shinichi Morishita, David Page, and Jun Sese. 2002. KDD Cup 2001 report., 47 pages. https://doi.org/10.1145/507515.507523

Digital Library

[13]

Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: automatic relational data augmentation for machine learning. Proceedings of the VLDB Endowment, Vol. 13, 9 (May 2020), 1373--1387. https://doi.org/10.14778/3397230.3397235

Digital Library

[14]

Fernando Chirigati, Rémi Rampin, Aécio Santos, Aline Bessa, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Augmentation. arxiv: 2102.05716 [cs.IR]

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423

[16]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, Vol. 11, 11 (Jul 2018), 1454--1467. https://doi.org/10.14778/3236187.3236198

Digital Library

[17]

Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management - aiDM '19. ACM Press, Amsterdam, Netherlands, 1--8. https://doi.org/10.1145/3329859.3329877

Digital Library

[18]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco California USA, 855--864. https://doi.org/10.1145/2939672.2939754

Digital Library

[19]

Anshul Gupta, George Karypis, and Vipin Kumar. 1997. Highly scalable parallel algorithms for sparse matrix factorization. IEEE Transactions on Parallel and Distributed systems, Vol. 8, 5 (1997), 502--520.

Digital Library

[20]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. 2010. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. arxiv: 0909.4061 [math.NA]

[21]

William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 (2017).

[22]

Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. 2021. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Transactions on Neural Networks and Learning Systems (2021), 1--21. https://doi.org/10.1109/TNNLS.2021.3070843

[23]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[24]

Denis Krompaß, Maximilian Nickel, et al. 2013. Non-negative tensor factorization with rescal. In ECML workshop .

[25]

Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join?: Thinking Twice about Joins before Feature Selection. In Proceedings of the 2016 International Conference on Management of Data. ACM, San Francisco California USA, 19--34. https://doi.org/10.1145/2882903.2882952

Digital Library

[26]

Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. 2018. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning. PMLR, 2863--2872.

[27]

Alyssa Whitlock Lees, Cong Yu, Huan Sun, Will Wu, and Xiang Deng. 2020. TURL: Table Understanding through Representation Learning.

[28]

Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA.

[29]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

[30]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14--1162

[31]

Matic Perovvs ek, Anvz e Vavpetivc, Janez Kranjc, Bojan Cestnik, and Nada Lavravc. 2015. Wordification: Propositionalization by unfolding relational data into bags of words. Expert Systems with Applications, Vol. 42, 17--18 (2015), 6442--6456.

[32]

Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning Over Dirty Data Without Cleaning. CoRR, Vol. abs/2004.02308 (2020). arxiv: 2004.02308 https://arxiv.org/abs/2004.02308

[33]

Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. Netsmf: Large-scale network embedding as sparse matrix factorization. In The World Wide Web Conference. 1509--1520.

Digital Library

[34]

Ram Sagar. 2021. Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/

[35]

Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, and Juliana Freire. 2021. Correlation Sketches for Approximate Join-Correlation Queries. arXiv preprint arXiv:2104.03353 (2021).

[36]

Vraj Shah, Arun Kumar, and Xiaojin Zhu. 2017. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? arXiv preprint arXiv:1704.00485 (2017).

[37]

Richard Socher, Danqi Chen, et al. 2013. Reasoning with Neural Tensor Networks for Knowledge Base Completion. In NIPS .

[38]

Petar Velicković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).

[39]

Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: Fast and Scalable Network Representation Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4278--4284. https://doi.org/10.24963/ijcai.2019/594

[40]

Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet Scale Domain Search. CoRR, Vol. abs/1603.07410 (2016). arxiv: 1603.07410 http://arxiv.org/abs/1603.07410

Cited By

Chen KKoudas N(2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654984
Gan QWang MWipf DFaloutsos CBaeza-Yates RBonchi F(2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671471
Ionescu AMouw ZAivaloglou EHai RKatsifodimos ASerra ESpezzano F(2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679211
Show More Cited By

Index Terms

Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation
1. Information systems
  1. Data management systems
    1. Information integration
      1. Mediators and data integration

Recommendations

Lossless data embedding: new paradigm in digital watermarking

One common drawback of virtually all current data embedding methods is the fact that the original image is inevitably distorted due to data embedding itself. This distortion typically cannot be removed completely due to quantization, bit-replacement, or ...
Reversible Data Embedding for Tamper-Proof Watermarks
ICICIC '06: Proceedings of the First International Conference on Innovative Computing, Information and Control - Volume 3

In this paper, a novel reversible data embedding for tamper-proof watermarks is proposed. A reversible watermark is embedded into robust watermark in the discrete wavelet transform (DWT) domain using a feature map and a location map. Generally, the ...
Mining relational databases with multi-view learning
MRDM '05: Proceedings of the 4th international workshop on Multi-relational mining

Most of today's structured data resides in relational databases where multiple relations are formed by foreign key joins. In recent years, the field of data mining has played a key role in helping humans analyze and explore large databases. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

June 2022

2597 pages

ISBN:9781450392495

DOI:10.1145/3514221

General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12 - 17, 2022

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
451
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)11

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen KKoudas N(2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654984
Gan QWang MWipf DFaloutsos CBaeza-Yates RBonchi F(2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671471
Ionescu AMouw ZAivaloglou EHai RKatsifodimos ASerra ESpezzano F(2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679211
Chai CJin KTang NFan JQiao LWang YLuo YYuan YWang G(2024)Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00278(3613-3626)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00278
Papadakis GKirielle NChristen PPalpanas T(2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00265
El Haddadi OChevalier MDousset BEl Allaoui AEl Haddadi ATeste O(2024)IDAGEmb: An Incremental Data Alignment Based on Graph EmbeddingBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_2(19-33)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_2
Fernandez RElmore AFranklin MKrishnan STan C(2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.14778/3611479.3611527
Li PChen ZChu XRong K(2023)DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular DataProceedings of the ACM on Management of Data10.1145/35893281:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589328
Galhotra SGong YFernandez R(2023)Metam: Goal-Oriented Data Discovery2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00213(2780-2793)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00213

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents