Scalable Tabular Metadata Location and Classification in Large-Scale Structured Datasets

Islam, Kazi; Gubanov, Michael

doi:10.1007/978-3-030-86472-9_4

Kazi Islam¹² &
Michael Gubanov¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12923))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1279 Accesses

Abstract

Tabular metadata (i.e. attribute names) location and classification is a fundamental problem for large-scale structured corpora. Web tables [24], CORD-19 [35], have thousands to millions of tables, but often have missing or incorrect labels for rows (or columns) with attribute names (e.g. Last Name). Missing or incorrect metadata labels [19] prevent or at least significantly complicate the fundamental data management tasks such as query processing, data integration, indexing, and many other. Different sources position metadata rows/columns differently inside a table, which makes its reliable identification challenging.

In this work we describe a scalable, hybrid two-layer Deep- and Machine-learning based ensemble, combining Long Short Term Memory (LSTM) and Naive Bayes Classifier to accurately identify Metadata-containing rows or columns in a table. We have performed an extensive evaluation on several structures datasets, including an ultra large-scale dataset containing more than 15 million tables coming from more than 26 thousands of sources to justify scalability and resistance to heterogeneity, stemming from a large number of sources. We observed superiority of this two-layer ensemble, compared to the recent previous approaches and report an impressive 81.53% accuracy at scale.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ENTRANT: A Large Financial Dataset for Table Understanding

Article Open access 13 August 2024

TableStrRec: framework for table structure recognition in data sheet images

Article 08 September 2023

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

References

Census bureau. https://www.census.gov/data/datasets.html
Alexe, B., et al.: Simplifying information integration: object-based flow-of-mappings framework for integration. In: Castellanos, M., Dayal, U., Sellis, T. (eds.) BIRTE 2008. LNBIP, vol. 27, pp. 108–121. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03422-0_9
Chapter Google Scholar
Braunschweig, K., Thiele, M., Lehner, W.: From web tables to concepts: a semantic normalization approach. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 247–260. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_18
Chapter Google Scholar
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. In: VLDB (2008)
Google Scholar
Cafarella, M.J., Halevy, A., Zhang, Y., Wang, D., Wu, E.: Uncovering the relational web. In: WebDB (2008)
Google Scholar
Chen, Z., Dadiomov, S., Wesley, R., Xiao, G., Cory, D., Cafarella, M., Mackinlay, J.: Spreadsheet property detection with rule-assisted active learning. In: CIKM. ACM (2017)
Google Scholar
Christodoulakis, C., Munson, E.B., Gabel, M., Brown, A.D., Miller, R.J.: Pytheas: pattern-based table discovery in CSV files. In: PVLDB, July 2020
Google Scholar
Codd, E.F.: A relational model of data for large shared data banks. In: CACM. vol. 13, no. 6, June 1970
Google Scholar
Dong, X.L.: Challenges and innovations in building a product knowledge graph. In: KDD (2018)
Google Scholar
Fang, J., Mitra, P., Tang, Z., Giles, C.L.: Table header detection and classification. In: AAAI, vol. 26, no. 1, July 2012
Google Scholar
Gentile, A.L., Ristoski, P., Eckel, S., Ritze, D., Paulheim, H.: Entity matching on web tables: a table embeddings approach for blocking. In: EDBT (2017)
Google Scholar
Gol, M.G., Pujara, J., Szekely, P.: Tabular cell classification using pre-trained cell embeddings. In: ICDM (2019)
Google Scholar
Gubanov, M.: Hybrid: a large-scale in-memory image analytics system. In: CIDR (2017)
Google Scholar
Gubanov, M.: Polyfuse: a large-scale hybrid data fusion system. In: ICDE (2017)
Google Scholar
Gubanov, M., Priya, M., Podkorytov, M.: CognitiveDB: an intelligent navigator for large-scale dark structured data. In: WWW (2017)
Google Scholar
Gubanov, M., Pyayt, A.: READFAST: high-relevance search-engine for big text. In: ACM CIKM (2013)
Google Scholar
Gubanov, M., Pyayt, A.: Type-aware web search. In: EDBT (2014)
Google Scholar
Gubanov, M.N., Popa, L., Ho, H., Pirahesh, H., Chang, J.-Y., Chen, S.-C.: IBM UFO repository: object-oriented data integration. In: VLDB (2009)
Google Scholar
Hancock, B., Lee, H., Yu, C.: Generating titles for web tables. In: WWW. ACM, New York (2019)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Google Scholar
Jain, L.C., Medsker, L.R.: Recurrent Neural Networks: Design and Applications, 1st edn. CRC Press Inc., Boca Raton (1999)
Google Scholar
Khan, R., Gubanov, M.: WebLens: towards interactive large-scale structured data profiling. In: CIKM. ACM (2020)
Google Scholar
Jiang, L., Vitagliano, G.: Structure detection in verbose CSV files. In: EDBT, March 2021
Google Scholar
Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables containing time and context metadata. In: Bourdeau, J., Hendler, J., Nkambou, R., Horrocks, I., Zhao, B.Y. (eds.) WWW (2016)
Google Scholar
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships (2010)
Google Scholar
Mulwad, V., Finin, T., Joshi, A.: Generating linked data by inferring the semantics of tables. In: VLDS, CEUR Workshop. CEUR-WS.org (2011)
Google Scholar
Ortiz, S., Enbatan, C., Podkorytov, M., Soderman, D., Gubanov, M.: Hybrid.json: high-velocity parallel in-memory polystore JSON ingest. In: IEEE Bigdata (2017)
Google Scholar
Podkorytov, M., Soderman, D., Gubanov, M.N.: Hybrid.poly: an interactive large-scale in-memory analytical polystore. In: ICDM Workshops, pp. 43–50. IEEE Computer Society (2017)
Google Scholar
Ritze, D., Bizer, C.: Matching web tables to DBpedia - a feature utility study. In: EDBT (2017)
Google Scholar
Simmons, M., Armstrong, D., Soderman, D., Gubanov, M.: Hybrid.media: high velocity video ingestion in an in-memory scalable analytical polystore. In: IEEE Bigdata (2017)
Google Scholar
Soderman, S., Kola, A., Podkorytov, M., Geyer, M., Gubanov, M.: Hybrid.AI: a learning search engine for large-scale structured data. In: WWW (2018)
Google Scholar
Subramanian, A., Srinivasa, S.: Semantic interpretation and integration of open data tables. In: Sarda, N.L., Acharya, P.S., Sen, S. (eds.) Geospatial Infrastructure, Applications and Technologies: India Case Studies, pp. 217–233. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-2330-0_17
Chapter Google Scholar
Uhrig, R.: Introduction to artificial neural networks. In: IECON, vol. 1, pp. 33–37 (1995)
Google Scholar
Villasenor, S., Nguyen, T., Kola, A., Soderman, S., Gubanov, M.: Scalable spam classifier for web tables. In: IEEE Big Data (2017)
Google Scholar
Wang, L.L., Lo, K., et al.: The covid-19 open research dataset. ArXiv (2020)
Google Scholar
Wang, N., Ren, X.: Identifying multiple entity columns in web tables. Int. J. Softw. Eng. Knowl. Eng. 28(3), 287–310 (2018)
Article Google Scholar
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: WWW 2002, pp. 242–250. ACM, New York (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Florida State University, Tallahassee, FL, 32306, USA
Kazi Islam & Michael Gubanov

Authors

Kazi Islam
View author publications
You can also search for this author in PubMed Google Scholar
Michael Gubanov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Gubanov .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Islam, K., Gubanov, M. (2021). Scalable Tabular Metadata Location and Classification in Large-Scale Structured Datasets. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2021. Lecture Notes in Computer Science(), vol 12923. Springer, Cham. https://doi.org/10.1007/978-3-030-86472-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-86472-9_4
Published: 31 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86471-2
Online ISBN: 978-3-030-86472-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scalable Tabular Metadata Location and Classification in Large-Scale Structured Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ENTRANT: A Large Financial Dataset for Table Understanding

TableStrRec: framework for table structure recognition in data sheet images

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Scalable Tabular Metadata Location and Classification in Large-Scale Structured Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ENTRANT: A Large Financial Dataset for Table Understanding

TableStrRec: framework for table structure recognition in data sheet images

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation