Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Scalable Tabular Metadata Location and Classification in Large-Scale Structured Datasets

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12923))

Included in the following conference series:

  • 1279 Accesses

Abstract

Tabular metadata (i.e. attribute names) location and classification is a fundamental problem for large-scale structured corpora. Web tables [24], CORD-19 [35], have thousands to millions of tables, but often have missing or incorrect labels for rows (or columns) with attribute names (e.g. Last Name). Missing or incorrect metadata labels [19] prevent or at least significantly complicate the fundamental data management tasks such as query processing, data integration, indexing, and many other. Different sources position metadata rows/columns differently inside a table, which makes its reliable identification challenging.

In this work we describe a scalable, hybrid two-layer Deep- and Machine-learning based ensemble, combining Long Short Term Memory (LSTM) and Naive Bayes Classifier to accurately identify Metadata-containing rows or columns in a table. We have performed an extensive evaluation on several structures datasets, including an ultra large-scale dataset containing more than 15 million tables coming from more than 26 thousands of sources to justify scalability and resistance to heterogeneity, stemming from a large number of sources. We observed superiority of this two-layer ensemble, compared to the recent previous approaches and report an impressive 81.53% accuracy at scale.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Census bureau. https://www.census.gov/data/datasets.html

  2. Alexe, B., et al.: Simplifying information integration: object-based flow-of-mappings framework for integration. In: Castellanos, M., Dayal, U., Sellis, T. (eds.) BIRTE 2008. LNBIP, vol. 27, pp. 108–121. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03422-0_9

    Chapter  Google Scholar 

  3. Braunschweig, K., Thiele, M., Lehner, W.: From web tables to concepts: a semantic normalization approach. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 247–260. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_18

    Chapter  Google Scholar 

  4. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. In: VLDB (2008)

    Google Scholar 

  5. Cafarella, M.J., Halevy, A., Zhang, Y., Wang, D., Wu, E.: Uncovering the relational web. In: WebDB (2008)

    Google Scholar 

  6. Chen, Z., Dadiomov, S., Wesley, R., Xiao, G., Cory, D., Cafarella, M., Mackinlay, J.: Spreadsheet property detection with rule-assisted active learning. In: CIKM. ACM (2017)

    Google Scholar 

  7. Christodoulakis, C., Munson, E.B., Gabel, M., Brown, A.D., Miller, R.J.: Pytheas: pattern-based table discovery in CSV files. In: PVLDB, July 2020

    Google Scholar 

  8. Codd, E.F.: A relational model of data for large shared data banks. In: CACM. vol. 13, no. 6, June 1970

    Google Scholar 

  9. Dong, X.L.: Challenges and innovations in building a product knowledge graph. In: KDD (2018)

    Google Scholar 

  10. Fang, J., Mitra, P., Tang, Z., Giles, C.L.: Table header detection and classification. In: AAAI, vol. 26, no. 1, July 2012

    Google Scholar 

  11. Gentile, A.L., Ristoski, P., Eckel, S., Ritze, D., Paulheim, H.: Entity matching on web tables: a table embeddings approach for blocking. In: EDBT (2017)

    Google Scholar 

  12. Gol, M.G., Pujara, J., Szekely, P.: Tabular cell classification using pre-trained cell embeddings. In: ICDM (2019)

    Google Scholar 

  13. Gubanov, M.: Hybrid: a large-scale in-memory image analytics system. In: CIDR (2017)

    Google Scholar 

  14. Gubanov, M.: Polyfuse: a large-scale hybrid data fusion system. In: ICDE (2017)

    Google Scholar 

  15. Gubanov, M., Priya, M., Podkorytov, M.: CognitiveDB: an intelligent navigator for large-scale dark structured data. In: WWW (2017)

    Google Scholar 

  16. Gubanov, M., Pyayt, A.: READFAST: high-relevance search-engine for big text. In: ACM CIKM (2013)

    Google Scholar 

  17. Gubanov, M., Pyayt, A.: Type-aware web search. In: EDBT (2014)

    Google Scholar 

  18. Gubanov, M.N., Popa, L., Ho, H., Pirahesh, H., Chang, J.-Y., Chen, S.-C.: IBM UFO repository: object-oriented data integration. In: VLDB (2009)

    Google Scholar 

  19. Hancock, B., Lee, H., Yu, C.: Generating titles for web tables. In: WWW. ACM, New York (2019)

    Google Scholar 

  20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Google Scholar 

  21. Jain, L.C., Medsker, L.R.: Recurrent Neural Networks: Design and Applications, 1st edn. CRC Press Inc., Boca Raton (1999)

    Google Scholar 

  22. Khan, R., Gubanov, M.: WebLens: towards interactive large-scale structured data profiling. In: CIKM. ACM (2020)

    Google Scholar 

  23. Jiang, L., Vitagliano, G.: Structure detection in verbose CSV files. In: EDBT, March 2021

    Google Scholar 

  24. Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables containing time and context metadata. In: Bourdeau, J., Hendler, J., Nkambou, R., Horrocks, I., Zhao, B.Y. (eds.) WWW (2016)

    Google Scholar 

  25. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships (2010)

    Google Scholar 

  26. Mulwad, V., Finin, T., Joshi, A.: Generating linked data by inferring the semantics of tables. In: VLDS, CEUR Workshop. CEUR-WS.org (2011)

    Google Scholar 

  27. Ortiz, S., Enbatan, C., Podkorytov, M., Soderman, D., Gubanov, M.: Hybrid.json: high-velocity parallel in-memory polystore JSON ingest. In: IEEE Bigdata (2017)

    Google Scholar 

  28. Podkorytov, M., Soderman, D., Gubanov, M.N.: Hybrid.poly: an interactive large-scale in-memory analytical polystore. In: ICDM Workshops, pp. 43–50. IEEE Computer Society (2017)

    Google Scholar 

  29. Ritze, D., Bizer, C.: Matching web tables to DBpedia - a feature utility study. In: EDBT (2017)

    Google Scholar 

  30. Simmons, M., Armstrong, D., Soderman, D., Gubanov, M.: Hybrid.media: high velocity video ingestion in an in-memory scalable analytical polystore. In: IEEE Bigdata (2017)

    Google Scholar 

  31. Soderman, S., Kola, A., Podkorytov, M., Geyer, M., Gubanov, M.: Hybrid.AI: a learning search engine for large-scale structured data. In: WWW (2018)

    Google Scholar 

  32. Subramanian, A., Srinivasa, S.: Semantic interpretation and integration of open data tables. In: Sarda, N.L., Acharya, P.S., Sen, S. (eds.) Geospatial Infrastructure, Applications and Technologies: India Case Studies, pp. 217–233. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-2330-0_17

    Chapter  Google Scholar 

  33. Uhrig, R.: Introduction to artificial neural networks. In: IECON, vol. 1, pp. 33–37 (1995)

    Google Scholar 

  34. Villasenor, S., Nguyen, T., Kola, A., Soderman, S., Gubanov, M.: Scalable spam classifier for web tables. In: IEEE Big Data (2017)

    Google Scholar 

  35. Wang, L.L., Lo, K., et al.: The covid-19 open research dataset. ArXiv (2020)

    Google Scholar 

  36. Wang, N., Ren, X.: Identifying multiple entity columns in web tables. Int. J. Softw. Eng. Knowl. Eng. 28(3), 287–310 (2018)

    Article  Google Scholar 

  37. Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: WWW 2002, pp. 242–250. ACM, New York (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Gubanov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Islam, K., Gubanov, M. (2021). Scalable Tabular Metadata Location and Classification in Large-Scale Structured Datasets. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2021. Lecture Notes in Computer Science(), vol 12923. Springer, Cham. https://doi.org/10.1007/978-3-030-86472-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86472-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86471-2

  • Online ISBN: 978-3-030-86472-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics