research-article

Auto-transform: learning-to-transform by patterns

Authors:

Surajit ChauduriAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 12

Pages 2368 - 2381

https://doi.org/10.14778/3407790.3407831

Published: 01 July 2020 Publication History

Abstract

Data Transformation is a long-standing problem in data management. Recent work adopts a "transform-by-example" (TBE) paradigm to infer transformation programs based on user-provided input/output examples, which greatly improves usability, and brought such features into mainstream software like Microsoft Excel, Power BI, and Trifacta.

While TBE is great progress, the need for users to provide paired input/output examples still poses limits on its applicability. In this work, we study an alternative that transforms data based on input/output data patterns only (without paired examples). We term this new paradigm transform-by-patterns (TBP). Specifically, we demonstrate that there is a rich class of transformations in TBP that can be "learned" from large collections of paired table columns. We show the proposed method can harvest such transformations across diverse domains and corpora (e.g., in different languages such as English, Chinese, Spanish, etc.). TBP transformations so obtained can be used in scenarios such as suggesting data-repairs in tables, or automating transformations in ETL pipelines. Extensive experiments on real data suggest that TBP outperforms existing methods on tasks such as data repairs, and is a promising direction for future research.

References

[1]

changing numbers to dates. https://aka.ms/stop-excel-change-date-formats.

[2]

Dresden web tables corpus. https://wwwdb.inf.tu-dresden.de/misc/dwtc/.

[3]

Grok data patterns in elasticsearch (retrieved 2019--11. https://github.com/elastic/elasticsearch/blob/master/libs/grok/src/main/resources/patterns/grok-patterns.

[4]

Informatica Advanced Data Transformation. https://www.informatica.com/products/data-integration/advanced-data-transformation.html.

[5]

Oadate in excel (retrieved 2019--11. https://docs.microsoft.com/en-us/dotnet/api/system.datetime.fromoadate?redirectedfrom=MSDN&view=netframework-4.8#System_DateTime_FromOADate_System_Double_, https://docs.microsoft.com/en-us/dotnet/api/system.globalization.datetimeformatinfo.getalldatetimepatterns?view=netframework-4.8.

[6]

Transform-by-Example feature in Trifacta. https://www.trifacta.com/blog/transform-by-example-your-data-cleaning-wish-is-our-command.

[7]

Trifacta: Standardize using patterns (retrieved 2019-03). https://docs.trifacta.com/display/SS/Standardize+Using+Patterns#StandardizeUsingPatterns-PatternsbyExample.

[8]

Web data commons - web tables corpus. http://km.aifb.kit.edu/sites/webdatacommons/webtables/index.html.

[9]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? VLDB, 9(12), 2016.

Digital Library

[10]

Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: A robust transformation discovery system. In ICDE, 2016.

[11]

F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In Proceedings of the 12th International Conference on Database Theory. ACM, 2009.

Digital Library

[12]

L. Berti-Equille, H. Harmouch, F. Naumann, N. Novelli, and T. Saravanan. Discovery of genuine functional dependencies from relational data with missing values [abstract for inforsid 2019]. In INFORSID 2019, 2019.

[13]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.

Digital Library

[14]

S. Bowers and B. Ludäscher. An ontology-driven framework for data transformation in scientific workflows. In International Workshop on Data Integration in the Life Sciences, pages 1--16. Springer, 2004.

[15]

N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. Submodular maximization with cardinality constraints. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1433--1452. SIAM, 2014.

Digital Library

[16]

R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265--1276, 2008.

Digital Library

[17]

K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.

[18]

K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3):15--28, 2016.

[19]

F. Chiang and R. J. Miller. Discovering data quality rules. Proceedings of the VLDB Endowment, 1(1):1166--1177, 2008.

Digital Library

[20]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. Proceedings of the VLDB Endowment, 6(13):1498--1509, 2013.

Digital Library

[21]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE. IEEE, 2013.

Digital Library

[22]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015.

Digital Library

[23]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.

Digital Library

[24]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.

Digital Library

[25]

D. Deng, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, G. Li, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Unsupervised string transformation learning for entity consolidation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 196--207. IEEE, 2019.

[26]

S. Dessloch, M. A. Hernández, R. Wisnesky, A. Radwan, and J. Zhou. Orchid: Integrating schema mapping and etl. In 2008 IEEE 24th International Conference on Data Engineering, pages 1307--1316. IEEE, 2008.

Digital Library

[27]

R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. imap: Discovering complex semantic matches between database schemas. In SIGMOD, SIGMOD '04, pages 383--394, New York, NY, USA, 2004. ACM.

Digital Library

[28]

K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. ACM SIGPLAN Notices, 43(1):421--434, 2008.

Digital Library

[29]

C. W. Gleverdon and C. W. Cleverdon. Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. 1962.

[30]

L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. VLDB, 3(1--2), 2010.

Digital Library

[31]

S. Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM Sigplan Notices, volume 46, pages 317--330. ACM, 2011.

Digital Library

[32]

J. Hare, C. Adams, A. Woodward, and H. Swinehart. Forecast snapshot: Self-service data preparation, worldwide, 2016. Gartner, Inc., February 2016.

[33]

Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (tde): an extensible search engine for data transformations. Proceedings of the VLDB Endowment, 11(10):1165--1177, 2018.

Digital Library

[34]

Y. He, K. Ganjam, and X. Chu. Sema-join: joining semantically-related tables using big table corpora. Proceedings of the VLDB Endowment, 8(12):1358--1369, 2015.

Digital Library

[35]

Y. He, K. Ganjam, K. Lee, Y. Wang, V. Narasayya, S. Chaudhuri, X. Chu, and Y. Zheng. Transform-data-by-example (tde) extensible data transformation in excel. In Proceedings of the 2018 International Conference on Management of Data, pages 1785--1788, 2018.

Digital Library

[36]

J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.

[37]

A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, pages 829--846, 2019.

Digital Library

[38]

J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[39]

Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In Proceedings of the 2018 International Conference on Management of Data, pages 1377--1392, 2018.

Digital Library

[40]

I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.

Digital Library

[41]

Z. Jin, M. R. Anderson, M. Cafarella, and H. V. Jagadish. Foofah: Transforming data by example. In SIGMOD, 2017.

Digital Library

[42]

Z. Jin, M. Cafarella, H. Jagadish, S. Kandel, M. Minar, and J. M. Hellerstein. Clx: Towards verifiable pbe data transformation. arXiv preprint arXiv:1803.00701, 2018.

[43]

S. Khot. Ruling out ptas for graph min-bisection, dense k-subgraph, and bipartite clique. SIAM Journal on Computing, 36(4):1025--1071, 2006.

Digital Library

[44]

R. Kimball and J. Caserta. The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data. John Wiley & Sons, 2011.

Digital Library

[45]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1), 1995.

Digital Library

[46]

B. Korte, J. Vygen, B. Korte, and J. Vygen. Combinatorial optimization, volume 2. Springer, 2012.

Digital Library

[47]

Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584, 2020.

[48]

M. Mahdavi, Z. Abedjan, R. Castro Fernandez, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, pages 865--882, 2019.

Digital Library

[49]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge university press, 2008.

Digital Library

[50]

P. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation rules. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405), pages 227--238. IEEE, 2003.

[51]

F. Neutatz, M. Mahdavi, and Z. Abedjan. Ed2: A case for active learning in error detection. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2249--2252, 2019.

Digital Library

[52]

A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. Anmat: automatic knowledge discovery and error detection through pattern functional dependencies. In Proceedings of the 2019 International Conference on Management of Data, pages 1977--1980, 2019.

Digital Library

[53]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.

[54]

V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001.

Digital Library

[55]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. VLDB, 10(11), 2017.

Digital Library

[56]

R. L. Sallam, P. Forry, E. Zaidi, and S. Vashisth. Gartner: Market guide for self-service data preparation. 2016.

[57]

M. Sanderson. Test collection based evaluation of information retrieval systems. Now Publishers Inc, 2010.

[58]

A. Simitsis, P. Vassiliadis, and T. Sellis. Optimizing etl processes in data warehouses. In 21st International Conference on Data Engineering (ICDE'05), pages 564--575. IEEE, 2005.

Digital Library

[59]

R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. Proceedings of the VLDB Endowment, 9(10):816--827, 2016.

Digital Library

[60]

P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, pages 811--828, 2019.

Digital Library

[61]

Y. Wang and Y. He. Synthesizing mapping relationships using table corpus. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1117--1132, 2017.

Digital Library

[62]

R. Wu, S. Chaba, S. Sawlani, X. Chu, and S. Thirumuruganathan. Zeroer: Entity resolution using zero labeled examples. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1149--1164, 2020.

Digital Library

[63]

M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.

Digital Library

[64]

C. Yan and Y. He. Synthesizing type-detection logic for rich semantic data types using open-source code. In Proceedings of the 2018 International Conference on Management of Data, pages 35--50, 2018.

Digital Library

[65]

J. N. Yan, O. Schulte, J. Wang, and R. Cheng. Coded: Column-oriented data error detection with statistical constraints.

[66]

D. Zhang, Y. Suhara, J. Li, M. Hulsebos, Ç. Demiralp, and W.-C. Tan. Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311, 2019.

Digital Library

[67]

C. Zhao and Y. He. Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference, pages 2413--2424, 2019.

Digital Library

[68]

E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment, 10(10):1034--1045, 2017.

Digital Library

Cited By

Li PHe YYan CWang YChaudhuri S(2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665269
Chen SHe YCui WFan JGe SZhang HZhang DChaudhuri S(2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654925
Dargahi Nobari ARafiei D(2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639279
Show More Cited By

Recommendations

Transform-data-by-example (TDE): an extensible search engine for data transformations

Today, business analysts and data scientists increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, and phone number, before they can perform analysis. This process of data transformation is an ...
Schema decryption for large extract-transform-load systems
ER'12: Proceedings of the 31st international conference on Conceptual Modeling

Extract-Transform-Load (Etl) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. Etl workflows populate those systems with data from various data sources by specifying and executing a ...
An Optical Affine Transform Based on MEMS Optical Cross-Connect
ISPA '10: Proceedings of the International Symposium on Parallel and Distributed Processing with Applications

The two-dimensional optical affine transform is studied in this paper. Conventional optical affine transform is implemented by dove prisms, mirrors, and zoom lens,with the shortage of experimental adjustment difficulties, low accuracy, poor flexibility ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 12

August 2020

1710 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2020

Published in PVLDB Volume 13, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
281
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)5

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li PHe YYan CWang YChaudhuri S(2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665269
Chen SHe YCui WFan JGe SZhang HZhang DChaudhuri S(2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654925
Dargahi Nobari ARafiei D(2024)DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language ModelsProceedings of the ACM on Management of Data10.1145/36392792:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639279
Li PHe YYan CWang YChaudhuri S(2023)Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using ExamplesProceedings of the VLDB Endowment10.14778/3611479.361153416:11(3391-3403)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611534
Shraga RMiller R(2023)Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VProceedings of the VLDB Endowment10.14778/3583140.358316916:6(1587-1600)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583169
Yang JHe YChaudhuri S(2021)Auto-pipelineProceedings of the VLDB Endowment10.14778/3476249.347630314:11(2563-2575)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476303
Khurana UGalhotra SDemartini GZuccon GCulpepper JHuang ZTong H(2021)Semantic Concept Annotation for Tabular DataProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482295(844-853)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482295

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents