Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Auto-transform: learning-to-transform by patterns

Published: 01 July 2020 Publication History

Abstract

Data Transformation is a long-standing problem in data management. Recent work adopts a "transform-by-example" (TBE) paradigm to infer transformation programs based on user-provided input/output examples, which greatly improves usability, and brought such features into mainstream software like Microsoft Excel, Power BI, and Trifacta.
While TBE is great progress, the need for users to provide paired input/output examples still poses limits on its applicability. In this work, we study an alternative that transforms data based on input/output data patterns only (without paired examples). We term this new paradigm transform-by-patterns (TBP). Specifically, we demonstrate that there is a rich class of transformations in TBP that can be "learned" from large collections of paired table columns. We show the proposed method can harvest such transformations across diverse domains and corpora (e.g., in different languages such as English, Chinese, Spanish, etc.). TBP transformations so obtained can be used in scenarios such as suggesting data-repairs in tables, or automating transformations in ETL pipelines. Extensive experiments on real data suggest that TBP outperforms existing methods on tasks such as data repairs, and is a promising direction for future research.

References

[1]
changing numbers to dates. https://aka.ms/stop-excel-change-date-formats.
[2]
Dresden web tables corpus. https://wwwdb.inf.tu-dresden.de/misc/dwtc/.
[3]
Grok data patterns in elasticsearch (retrieved 2019--11. https://github.com/elastic/elasticsearch/blob/master/libs/grok/src/main/resources/patterns/grok-patterns.
[4]
Informatica Advanced Data Transformation. https://www.informatica.com/products/data-integration/advanced-data-transformation.html.
[5]
Oadate in excel (retrieved 2019--11. https://docs.microsoft.com/en-us/dotnet/api/system.datetime.fromoadate?redirectedfrom=MSDN&view=netframework-4.8#System_DateTime_FromOADate_System_Double_, https://docs.microsoft.com/en-us/dotnet/api/system.globalization.datetimeformatinfo.getalldatetimepatterns?view=netframework-4.8.
[6]
Transform-by-Example feature in Trifacta. https://www.trifacta.com/blog/transform-by-example-your-data-cleaning-wish-is-our-command.
[7]
Trifacta: Standardize using patterns (retrieved 2019-03). https://docs.trifacta.com/display/SS/Standardize+Using+Patterns#StandardizeUsingPatterns-PatternsbyExample.
[8]
Web data commons - web tables corpus. http://km.aifb.kit.edu/sites/webdatacommons/webtables/index.html.
[9]
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? VLDB, 9(12), 2016.
[10]
Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: A robust transformation discovery system. In ICDE, 2016.
[11]
F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In Proceedings of the 12th International Conference on Database Theory. ACM, 2009.
[12]
L. Berti-Equille, H. Harmouch, F. Naumann, N. Novelli, and T. Saravanan. Discovery of genuine functional dependencies from relational data with missing values [abstract for inforsid 2019]. In INFORSID 2019, 2019.
[13]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
[14]
S. Bowers and B. Ludäscher. An ontology-driven framework for data transformation in scientific workflows. In International Workshop on Data Integration in the Life Sciences, pages 1--16. Springer, 2004.
[15]
N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. Submodular maximization with cardinality constraints. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1433--1452. SIAM, 2014.
[16]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265--1276, 2008.
[17]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.
[18]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3):15--28, 2016.
[19]
F. Chiang and R. J. Miller. Discovering data quality rules. Proceedings of the VLDB Endowment, 1(1):1166--1177, 2008.
[20]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. Proceedings of the VLDB Endowment, 6(13):1498--1509, 2013.
[21]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE. IEEE, 2013.
[22]
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015.
[23]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
[24]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.
[25]
D. Deng, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, G. Li, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Unsupervised string transformation learning for entity consolidation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 196--207. IEEE, 2019.
[26]
S. Dessloch, M. A. Hernández, R. Wisnesky, A. Radwan, and J. Zhou. Orchid: Integrating schema mapping and etl. In 2008 IEEE 24th International Conference on Data Engineering, pages 1307--1316. IEEE, 2008.
[27]
R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. imap: Discovering complex semantic matches between database schemas. In SIGMOD, SIGMOD '04, pages 383--394, New York, NY, USA, 2004. ACM.
[28]
K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. ACM SIGPLAN Notices, 43(1):421--434, 2008.
[29]
C. W. Gleverdon and C. W. Cleverdon. Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. 1962.
[30]
L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. VLDB, 3(1--2), 2010.
[31]
S. Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM Sigplan Notices, volume 46, pages 317--330. ACM, 2011.
[32]
J. Hare, C. Adams, A. Woodward, and H. Swinehart. Forecast snapshot: Self-service data preparation, worldwide, 2016. Gartner, Inc., February 2016.
[33]
Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (tde): an extensible search engine for data transformations. Proceedings of the VLDB Endowment, 11(10):1165--1177, 2018.
[34]
Y. He, K. Ganjam, and X. Chu. Sema-join: joining semantically-related tables using big table corpora. Proceedings of the VLDB Endowment, 8(12):1358--1369, 2015.
[35]
Y. He, K. Ganjam, K. Lee, Y. Wang, V. Narasayya, S. Chaudhuri, X. Chu, and Y. Zheng. Transform-data-by-example (tde) extensible data transformation in excel. In Proceedings of the 2018 International Conference on Management of Data, pages 1785--1788, 2018.
[36]
J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.
[37]
A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, pages 829--846, 2019.
[38]
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
[39]
Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In Proceedings of the 2018 International Conference on Management of Data, pages 1377--1392, 2018.
[40]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[41]
Z. Jin, M. R. Anderson, M. Cafarella, and H. V. Jagadish. Foofah: Transforming data by example. In SIGMOD, 2017.
[42]
Z. Jin, M. Cafarella, H. Jagadish, S. Kandel, M. Minar, and J. M. Hellerstein. Clx: Towards verifiable pbe data transformation. arXiv preprint arXiv:1803.00701, 2018.
[43]
S. Khot. Ruling out ptas for graph min-bisection, dense k-subgraph, and bipartite clique. SIAM Journal on Computing, 36(4):1025--1071, 2006.
[44]
R. Kimball and J. Caserta. The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data. John Wiley & Sons, 2011.
[45]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1), 1995.
[46]
B. Korte, J. Vygen, B. Korte, and J. Vygen. Combinatorial optimization, volume 2. Springer, 2012.
[47]
Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584, 2020.
[48]
M. Mahdavi, Z. Abedjan, R. Castro Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, pages 865--882, 2019.
[49]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge university press, 2008.
[50]
P. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation rules. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405), pages 227--238. IEEE, 2003.
[51]
F. Neutatz, M. Mahdavi, and Z. Abedjan. Ed2: A case for active learning in error detection. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2249--2252, 2019.
[52]
A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. Anmat: automatic knowledge discovery and error detection through pattern functional dependencies. In Proceedings of the 2019 International Conference on Management of Data, pages 1977--1980, 2019.
[53]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.
[54]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001.
[55]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. VLDB, 10(11), 2017.
[56]
R. L. Sallam, P. Forry, E. Zaidi, and S. Vashisth. Gartner: Market guide for self-service data preparation. 2016.
[57]
M. Sanderson. Test collection based evaluation of information retrieval systems. Now Publishers Inc, 2010.
[58]
A. Simitsis, P. Vassiliadis, and T. Sellis. Optimizing etl processes in data warehouses. In 21st International Conference on Data Engineering (ICDE'05), pages 564--575. IEEE, 2005.
[59]
R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. Proceedings of the VLDB Endowment, 9(10):816--827, 2016.
[60]
P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, pages 811--828, 2019.
[61]
Y. Wang and Y. He. Synthesizing mapping relationships using table corpus. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1117--1132, 2017.
[62]
R. Wu, S. Chaba, S. Sawlani, X. Chu, and S. Thirumuruganathan. Zeroer: Entity resolution using zero labeled examples. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1149--1164, 2020.
[63]
M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.
[64]
C. Yan and Y. He. Synthesizing type-detection logic for rich semantic data types using open-source code. In Proceedings of the 2018 International Conference on Management of Data, pages 35--50, 2018.
[65]
J. N. Yan, O. Schulte, J. Wang, and R. Cheng. Coded: Column-oriented data error detection with statistical constraints.
[66]
D. Zhang, Y. Suhara, J. Li, M. Hulsebos, Ç. Demiralp, and W.-C. Tan. Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311, 2019.
[67]
C. Zhao and Y. He. Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference, pages 2413--2424, 2019.
[68]
E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment, 10(10):1034--1045, 2017.

Cited By

View all
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
  • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)5
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
  • (2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
  • (2023)Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using ExamplesProceedings of the VLDB Endowment10.14778/3611479.361153416:11(3391-3403)Online publication date: 24-Aug-2023
  • (2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
  • (2021)Auto-pipelineProceedings of the VLDB Endowment10.14778/3476249.347630314:11(2563-2575)Online publication date: 27-Oct-2021
  • (2021)Semantic Concept Annotation for Tabular DataProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482295(844-853)Online publication date: 26-Oct-2021

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media