Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3514221.3517891acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation

Published: 11 June 2022 Publication History

Abstract

In this paper, we present Leva, an end-to-end system that boosts the performance of machine learning tasks over relational data. Leva builds a relational embedding by representing relational data as a graph and then using embedding methods to represent the graph as vectors. The embedding represents information from the entire database, including useful information for the downstream machine learning task. At the same time, some information in the graph will be erroneous, for example, corresponding to incorrect inclusion dependencies. However, we show that the supervision signal from the downstream task filters out information that is not useful. The result is a boost in ML performance. This result means that it is possible for analysts to avoid the time-consuming effort of collecting features across multiple relations-which requires solving a data discovery and integration problem-and instead rely on these techniques to train better-performing models. We demonstrate Leva's performance on different classification and regression datasets and compare it with multiple other baselines.

Supplemental Material

MP4 File
Presentation video.

References

[1]
Anna Atramentov, Hector Leiva, and Vasant Honavar. 2003. A multi-relational decision tree learning algorithm--implementation and experiments. In International Conference on Inductive Logic Programming. Springer, 38--56.
[2]
Petr Berka. 1999. Workshop notes on Discovery Challenge PKDD'99. http://lisp.vse.cz/pkdd99/
[3]
Hendrik Blockeel, Savso Dvzeroski, Boris Kompare, Stefan Kramer, and Bernhard Pfahringer. 2004. Experiments In Predicting Biodegradability. Applied Artificial Intelligence, Vol. 18, 2 (2004), 157--181. https://doi.org/10.1.1.2.3797
[4]
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), Vol. 36, 4 (1989), 929--965.
[5]
Rajesh Bordawekar and Oded Shmueli. 2017. Using Word Embedding to Enable Semantic Queries in Relational Databases. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning. ACM, Chicago IL USA, 1--4. https://doi.org/10.1145/3076246.3076251
[6]
Antoine Bordes, Nicolas Usunier, et al. 2013. Translating Embeddings for Modeling Multi-relational Data. In NIPS .
[7]
Antoine Bordes, Jason Weston, et al. 2011. Learning Structured Embeddings of Knowledge Bases. In AAAI .
[8]
Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (Melbourne, Australia) (CIKM '15). Association for Computing Machinery, New York, NY, USA, 891--900. https://doi.org/10.1145/2806416.2806512
[9]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland OR USA, 1335--1349. https://doi.org/10.1145/3318464.3389742
[10]
R. Castro Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. 2018. Aurum: A Data Discovery System. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1001--1012. https://doi.org/10.1109/ICDE.2018.00094
[11]
Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1190--1201. https://doi.org/10.1109/ICDE.2019.00109
[12]
Jie Cheng, Christos Hatzis, Hisashi Hayashi, Mark-André Krogel, Shinichi Morishita, David Page, and Jun Sese. 2002. KDD Cup 2001 report., 47 pages. https://doi.org/10.1145/507515.507523
[13]
Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: automatic relational data augmentation for machine learning. Proceedings of the VLDB Endowment, Vol. 13, 9 (May 2020), 1373--1387. https://doi.org/10.14778/3397230.3397235
[14]
Fernando Chirigati, Rémi Rampin, Aécio Santos, Aline Bessa, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Augmentation. arxiv: 2102.05716 [cs.IR]
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423
[16]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, Vol. 11, 11 (Jul 2018), 1454--1467. https://doi.org/10.14778/3236187.3236198
[17]
Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management - aiDM '19. ACM Press, Amsterdam, Netherlands, 1--8. https://doi.org/10.1145/3329859.3329877
[18]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco California USA, 855--864. https://doi.org/10.1145/2939672.2939754
[19]
Anshul Gupta, George Karypis, and Vipin Kumar. 1997. Highly scalable parallel algorithms for sparse matrix factorization. IEEE Transactions on Parallel and Distributed systems, Vol. 8, 5 (1997), 502--520.
[20]
Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. 2010. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. arxiv: 0909.4061 [math.NA]
[21]
William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 (2017).
[22]
Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. 2021. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Transactions on Neural Networks and Learning Systems (2021), 1--21. https://doi.org/10.1109/TNNLS.2021.3070843
[23]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[24]
Denis Krompaß, Maximilian Nickel, et al. 2013. Non-negative tensor factorization with rescal. In ECML workshop .
[25]
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join?: Thinking Twice about Joins before Feature Selection. In Proceedings of the 2016 International Conference on Management of Data. ACM, San Francisco California USA, 19--34. https://doi.org/10.1145/2882903.2882952
[26]
Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. 2018. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning. PMLR, 2863--2872.
[27]
Alyssa Whitlock Lees, Cong Yu, Huan Sun, Will Wu, and Xiang Deng. 2020. TURL: Table Understanding through Representation Learning.
[28]
Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA.
[29]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
[30]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14--1162
[31]
Matic Perovvs ek, Anvz e Vavpetivc, Janez Kranjc, Bojan Cestnik, and Nada Lavravc. 2015. Wordification: Propositionalization by unfolding relational data into bags of words. Expert Systems with Applications, Vol. 42, 17--18 (2015), 6442--6456.
[32]
Jose Picado, John Davis, Arash Termehchy, and Ga Young Lee. 2020. Learning Over Dirty Data Without Cleaning. CoRR, Vol. abs/2004.02308 (2020). arxiv: 2004.02308 https://arxiv.org/abs/2004.02308
[33]
Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. Netsmf: Large-scale network embedding as sparse matrix factorization. In The World Wide Web Conference. 1509--1520.
[34]
Ram Sagar. 2021. Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/
[35]
Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, and Juliana Freire. 2021. Correlation Sketches for Approximate Join-Correlation Queries. arXiv preprint arXiv:2104.03353 (2021).
[36]
Vraj Shah, Arun Kumar, and Xiaojin Zhu. 2017. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? arXiv preprint arXiv:1704.00485 (2017).
[37]
Richard Socher, Danqi Chen, et al. 2013. Reasoning with Neural Tensor Networks for Knowledge Base Completion. In NIPS .
[38]
Petar Velicković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
[39]
Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: Fast and Scalable Network Representation Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4278--4284. https://doi.org/10.24963/ijcai.2019/594
[40]
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet Scale Domain Search. CoRR, Vol. abs/1603.07410 (2016). arxiv: 1603.07410 http://arxiv.org/abs/1603.07410

Cited By

View all
  • (2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
  • (2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
  • (2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
  • Show More Cited By

Index Terms

  1. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
    June 2022
    2597 pages
    ISBN:9781450392495
    DOI:10.1145/3514221
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data discovery
    2. feature engineering
    3. training data augmentation

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
    • (2024)Graph Machine Learning Meets Multi-Table Relational DataProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671471(6502-6512)Online publication date: 25-Aug-2024
    • (2024)Human-in-the-Loop Feature Discovery for Tabular DataProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679211(5215-5219)Online publication date: 21-Oct-2024
    • (2024)Mitigating Data Scarcity in Supervised Machine Learning Through Reinforcement Learning Guided Data Generation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00278(3613-3626)Online publication date: 13-May-2024
    • (2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
    • (2024)IDAGEmb: An Incremental Data Alignment Based on Graph EmbeddingBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_2(19-33)Online publication date: 26-Aug-2024
    • (2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 1-Jul-2023
    • (2023)DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular DataProceedings of the ACM on Management of Data10.1145/35893281:2(1-26)Online publication date: 20-Jun-2023
    • (2023)Metam: Goal-Oriented Data Discovery2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00213(2780-2793)Online publication date: Apr-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media