Abstract
Automatic interpretation of web tables can enable database-like semantic search over the plethora of information stored in tables on the web. Our table interpretation method presented here converts the two-dimensional hierarchy of table headers, which provides a visual means of assimilating complex data, into a set of strings that is more amenable to algorithmic analysis of table structure. We show that Header Paths, a new purely syntactic representation of visual tables, can be readily transformed (“factored”) into several existing representations of structured data, including category trees and relational tables. Detailed examination of over 100 tables reveals what table features require further work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Wang, X.: Tabular Abstraction, Editing, and Formatting, Ph.D Dissertation, University of Waterloo, Waterloo, ON, Canada (1996)
Embley, D.W., Hurst, M., Lopresti, D., Nagy, G.: Table Processing Paradigms: A Research Survey. Int. J. Doc. Anal. Recognit. 8(2-3), 66–86 (2006)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. International Journal of Document Analysis and Recognition 7(1), 1–16 (2004)
Krüpl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data from arbitrary HTML documents. In: Proceedings. of the 14th Int’l Conf. on World Wide Web, pp. 1000–1001 (2005)
Pivk, A., Ciamiano, P., Sure, Y., Gams, M., Rahkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data and Knowledge Engineering 60(3), 567–595 (2007)
Silva, E.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. 8(2), 144–171 (2006)
Esposito, F., Ferilli, S., Di Mauro, N., Basile, T.M.A.: Incremental Learning of First Order Logic Theories for the Automatic Annotations of Web Documents. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR-2007), Curitiba, Brazil, September 23-26, pp. 1093–1097. IEEE Computer Society, Los Alamitos (2007); ISBN 0-7695-2822-8, ISSN 1520-5363
Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for Digital Document Processing: From Layout Analysis To Metadata Extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI, vol. 90, pp. 79–112. Springer, Berlin (2008); ISBN 978-3-540-76279-9
Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Padmanabhan, R., Seth, S., Silversmith, W.: From Tessellations to Table Interpretation. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) MKM 2009, Held as Part of CICM 2009. LNCS, vol. 5625, pp. 422–437. Springer, Heidelberg (2009)
Fateman, R. J.: Essays in Symbolic Simplification. MIT-LCS-TR-095, 4-1-1972, http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-095.pdf (downloaded November 10, 2010)
Knuth, D.E.: 4.6.2 Factorization of Polynomials". Seminumerical Algorithms. In: The Art of Computer Programming, 2nd edn., pp. 439–461, 678–691. Addison-Wesley, Reading (1997)
Kaltofen, E.: Polynomial factorization: a success story. In: ISSAC 2003 Proc. 2003 Internat. Symp. Symbolic Algebraic Comput. [-12], pp. 3–4 (2003)
Brayton, R.K., McMullen, C.: The Decomposition and Factorization of Boolean Expressions. In: Proceedings of the International Symposium on Circuits and Systems, pp. 49–54 (May1982)
Vasudevamurthy, J., Rajski, J.: A Method for Concurrent Decomposition and Factorization of Boolean Expressions. In: Proceedings of the International Conference on Computer-Aided Design, pp. 510–513 (November 1990)
Sentovich, E.M., Singh, K.J., Lavagno, L., Moon, C., Murgai, R., Saldanha, A., Savoj, H., Stephan, P.R., Brayton, R.K., Sangiovanni-Vincentelli, A.L.: SIS: A System for Sequential Circuit Synthesis. In: Memorandum No. UCB/ERL M92/41, Electronics Research Laboratory, University of California, Berkeley (May 1992), http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/ERL-92-41.pdf (downloaded November 4, 2010)
(Quickmath-ref), http://www.quickmath.com/webMathematica3/quickmath/page.jsp?s1=algebra&s2=factor&s3=advanced (last accessed November 12, 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Embley, D.W., Krishnamoorthy, M., Nagy, G., Seth, S. (2011). Factoring Web Tables. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds) Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science(), vol 6703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21822-4_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-21822-4_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21821-7
Online ISBN: 978-3-642-21822-4
eBook Packages: Computer ScienceComputer Science (R0)