Abstract
Tables in documents are a rich source of information, but not yet well-utilised computationally because of the difficulty of extracting their structure and data automatically. In this paper, we progress the state-of-the-art in automatic table extraction by identifying common patterns in table headers to develop rules and heuristics for determining table structure. We describe and evaluate a table understanding system using these patterns and rules.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
XSLT, http://www.w3.org/TR/xslt.
- 3.
- 4.
We are grateful to the authors of work for sharing the dataset.
- 5.
- 6.
- 7.
The ICDAR dataset only has ground truth for table extraction (locating and segmenting).
References
Alrayes, N., Luk, W.-S.: Automatic transformation of multi-dimensional web tables into data cubes. Data Warehousing and Knowledge Discovery. LNCS, vol. 7448, pp. 81–92. Springer, Heidelberg (2012)
e Silva, A.C., Jorge, A., Torgo, L.: Design of an end-to-end method to extract information from tables. IJDAR 82(2–3), 144–171 (2006)
Embley, D.W., Hurst, M., Lopresti, D., Nagy, G.: Table-processing paradigms: a research survey. IJDAR 8(2–3), 66–86 (2006)
Fang, J., Mitra, P., Tang, Z., Giles, C.L.: Table header detection and classification. In: AAAI (2012)
Jha, P., Nagy, G.: Wang notation tool: layout independent representation of tables. In: ICPR, pp. 1–4. IEEE (2008)
Nagy, G.: Learning the characteristics of critical cells from web tables. In: ICPR, pp. 1554–1557. IEEE (2012)
Nagy, G., Seth, S., Embley, D.W.: End-to-end conversion of html tables for populating a relational database. In: DAS, pp. 222–226. IEEE (2014)
Nagy, G., Tamhankar, M.: Vericlick: an efficient tool for table format verification. In: IS&T/SPIE Electronic Imaging, pp. 1–9 (2012)
Oro, E., Ruffolo, M.: PDF-TREX: an approach for recognizing and extracting tables from pdf documents. In: ICDAR, pp. 906–910. IEEE (2009)
Padmanabhan, R.K.: Table abstraction tool. PhD thesis, Citeseer (2009)
Rastan, R., Paik, H.-Y., Shepherd, J.: TEXUS: a task-based approach for table extraction and understanding. In: DocEng2015, pp. 25–34 (2015)
Seth, S., Jandhyala, R., Krishnamoorthy, M., Nagy, G.: Analysis and taxonomy of column header categories for web tables. In: IAPR, pp. 81–88. ACM (2010)
Seth, S., Nagy, G.: Segmenting tables via indexing of value cells by table headers. In: ICDAR, pp. 887–891. IEEE (2013)
Wang, X.: Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo (1996)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004)
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Rastan, R., Paik, Hy., Shepherd, J., Haller, A. (2016). Automated Table Understanding Using Stub Patterns. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-32025-0_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)