Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Table-processing paradigms: a research survey

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Tables are a ubiquitous form of communication. While everyone seems to know what a table is, a precise, analytical definition of “tabularity” remains elusive because some bureaucratic forms, multicolumn text layouts, and schematic drawings share many characteristics of tables. There are significant differences between typeset tables, electronic files designed for display of tables, and tables in symbolic form intended for information retrieval. Most past research has addressed the extraction of low-level geometric information from raster images of tables scanned from printed documents, although there is growing interest in the processing of tables in electronic form as well. Recent research on table composition and table analysis has improved our understanding of the distinction between the logical and physical structures of tables, and has led to improved formalisms for modeling tables. This review, which is structured in terms of generalized paradigms for table processing, indicates that progress on half-a-dozen specific research issues would open the door to using existing paper and electronic tables for database update, tabular browsing, structured information retrieval through graphical and audio interfaces, multimedia table editing, and platform-independent display.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abu-Tarif, A.: Table processing and table understanding. Master'sthesis, Rensselaer Polytechnic Institute, May (1998)

  2. Alam, H., Rahman, F.: Web document manipulation for small screendevices: A review. In: Proceedings of the Second InternationalWorkshop on Web Document Analysis (WDA2003) (2003)http://www.csc.liv.ac.uk/˜wda2003/Papers/Section_II/Paper_8.pdf

  3. lam, H., Rahman, F., Tarnikova, Y.: When is a list is a list?:Web page re-authoring for small display devices. In: Proceedingsof the Twelfth International World Wide Web Conference, Budapest, Hungary, May (2003)http://www2003.org/cdrom/papers/poster/p054/p54-Alam.htm

  4. Alrashed, S., Gray, W.A.: Detection approaches for table semanticsin text. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DocumentAnalysis Systems V. Lecture Notes in Computer Science, vol.2423, pp. 287–290. Springer-Verlag, Berlin, Germany (2002)

  5. Amano, A., Asada, N.: Graph grammar based analysis system ofcomplex table form document. In: Proceedings of the SeventhInternational Conference on Document Analysis and Recognition(2003)

  6. Anjewierden, A.: AIDAS: Incremental logical structure discoveryin PDF documents. In: Proceedings of the Sixth InternationalConference on Document Analysis and Recognition, pp. 374–378, Seattle, WA, September (2001)

  7. Arias, J.F., Balasubramanian, S., Prasad, A., Kasturi, R., Chhabra, A.: Information extraction from telephone companydrawings. In: Proceedings of the Conference on Computer Visionand Pattern Recognition, pp. 729–732, Seattle, Washington, June(1994)

  8. Arias, J.F., Chhabra, A., Misra, V.: Efficient interpretation oftabular documents. In: Proceedings of the InternationalConference on Pattern Recognition (ICPR'96), vol. III, pp.681–685, Vienna, Austria, August (1996)

  9. Arias, J.F., Chhabra, A., Misra, V.: Interpreting and representingtabular documents. In: Proceedings of the Conference on ComputerVision and Pattern Recognition, pp. 600–605, San Francisco, CA, June (1996)

  10. Arias, J.F., Kasturi, R.: Efficient techniques for line drawinginterpretation and their application to telephone companydrawings. Technical Report CSE TR CSE-95-020, Penn StateUniversity, August (1995)

  11. Balasubramanian, S., Chandran, S., Arias, J.F., Kasturi, R., Chhabra, A.: Information extraction from tabular drawings. In:Proceedings of Document Recognition I (IS&T/SPIE ElectronicImaging'94), vol. 2181, pp. 152–163, San Jose, CA, June (1994)

  12. Bayer, T.A.: Understanding structured text documents by a modelbased document analysis system. In: Proceedings of the SecondInternational Conference on Document Analysis and Recognition(ICDAR'93), pp. 448–453, Tsukuba Science City, Japan, October(1993)

  13. Hong, X.: New method for logical structureextraction of form document image. In: Proceedings of DocumentRecognition and Retrieval VI (IS&T/SPIE Electronic Imaging'99), vol. 3651, pp. 183–193, San Jose, CA, January (1999)

  14. Cesarini, F., Marinari, S., Sarti, L., Soda, G.: Traininable tablelocation in document images. In: Proceedings of the InternationalConference on Pattern Recognition, vol. III, pp. 236–240 (2002)

  15. Chandran, S., Kasturi, R.: Structural recognition of tabulateddata. In: Proceedings of the Second International Conference onDocument Analysis and Recognition (ICDAR'93), pp. 516–519, Tsukuba Science City, Japan, October (1993)

  16. Chao, H.: Background pattern recognition in multi-page PDFdocument. In: Proceedings of the Third International Workshop onDocument Layout Interpretations and its Applications (DLIA2003), pp. 41–45 (2003)http://www.science.uva.nl/events/dlia2003/program/41-46-chao.pdf

  17. Chao, H., Fan, J.: Layout and content extraction for PDFdocuments. In: Marinai, S., Dengel, A. (eds.) Document AnalysisSystems VI. Lecture Notes in Computer Science, vol. 3163, pp.213–224. Springer-Verlag, Berlin, Germany (2004)

  18. Chen, P.: The entity-relationship model—toward a unified view ofdata. ACM Trans. Database Syst. 1(1) (1976)

  19. Chhabra, A.K., Misra, V., Arias, J.: Detection of horizontal linesin noisy run length encoded images: The FAST method. In:Kasturi, R., Tombre, K. (eds.) Graphics Recognition—Methods andApplications. Lecture Notes in Computer Science, vol. 1072, pp.35–48. Springer-Verlag, Berlin, Germany (1996)

  20. Chiang, R., Barron, T., Storey, V.: Reverse engineering ofrelational databases: Extraction of an EER model from a relationaldatabase. Data Knolw. Eng. 12(1):107–142 (1994)

    Article  Google Scholar 

  21. Codd, E.: A relational model for large shared data banks. Commun.ACM 13(6):377–487 (1970)

    Article  MATH  Google Scholar 

  22. Cohen, W., Hurst, M., Jensen, L.S.: A flexible learning system forwrapping tables and lists in HTML documents. In: Proceedings ofthe Eleventh International World Wide Web Conference WWW-2002(2002)

  23. Corporation, L.D.: Improv Handbook (1991)

  24. Coüasnon, B.: DMOS: A generic document recognition method, application to an automatic generator of musical scores, mathematical formulae and table structures recognition systems.In: Proceedings of the Sixth International Conference on DocumentAnalysis and Recognition, Seattle, WA, September (2001)

  25. Coüasnon, B., Camillerapp, J., Leplumey, I.: Making handwrittenarchives documents accessible to public with a generic system ofdocument image analysis. In: Proceedings of the InternationalWorkshop on Document Image Analysis for Libraries, pp. 270–277, Palo Alto, CA, January (2004)

  26. Douglas, S., Hurst, M., Quinn, D.: Using natural languageprocessing for identifying and interpreting tables in plain text.In: Proceedings of the Symposium on Document Analysis andInformation Retrieval (SDAIR'95), pp. 535–545, Las Vegas, NV, April (1995)

  27. Embley, D., Kurtz, B., Woodfield, S.: Object-oriented SystemsAnalysis: A Model Driven Apprach. Yourdon Press (1992)

  28. Embley, D., Tao, C., Liddle, S.: Automating the extraction of datafrom HTML tables with unknown structure. Data Knowl. Eng. (inpress)

  29. Gray, P., Embury, S., Gray, W., Hui, K.: An agent-based system forhandling distributed design constraints. In: Proceedings ofAgents'98 (1998)

  30. Green, E.A.: Ph.d. research (1997).http://tardis.union.edu/greene/research-dir/research.html

  31. Green, E.A., Krishnamoorthy, M.: Model-based analysis of printedtables. In: Proceedings of the First International Workshop onGraphics Recognition (GREC'95), pp. 234–242, PA (1995)

  32. Green, E.A., Krishnamoorthy, M.: Model-based analysis of printedtables. In: Proceedings of the Third International Conference onDocument Analysis and Recognition (ICDAR'95), pp. 214–217, Montréal, Canada, August (1995)

  33. Green, E.A., Krishnamoorthy, M.: Recognition of tables using tablegrammars. In: Proceedings of the Symposium on Document Analysisand Information Retrieval (SDAIR'95), pp. 261–277, Las Vegas, NV, April (1995)

  34. Gruber, T.: A translation approach to portable ontologyspecifications. Knowl. Acquis. 5:199–220 (1993)

    Article  Google Scholar 

  35. Guthrie, J.T., Britten, T., Barker, K.G.: Roles of documentstructure, cognitive strategy, and awareness in searching forinformation. Int. Read. Assoc. (1991)

  36. Haas, T.: The development of a prototype knowledge-basedtable-processing system. Master's thesis, Brigham YoungUniversity, Provo, Utah, April (1998)

  37. Handley, J.C.: Document recognition. In: Dougherty, E.R. (ed.)Electronic Imaging Technology, chapter 8. SPIE—TheInternational Society for Optical Engineering (1999)

  38. Handley, J.C.: Table analysis for multiline cell identification.In: Kantor, P.B., Lopresti, D.P., Zhou, J. (eds.) Proceedings ofDocument Recognition and Retrieval VIII (IS&T/SPIE Electronic Imaging), vol. 4307, San Jose, CA, January (2001)

  39. Hirayama, Y.: A method for table structure analysis using DPmatching. In: Proceedings of the Third International Conferenceon Document Analysis and Recognition (ICDAR'95), pp. 583–586, Montréal, Canada, August (1995)

  40. Hori, O., Doermann, D.S.: Robust table-form structure analysisbased on box-driven reasoning. In: Proceedings of the ThirdInternational Conference on Document Analysis and Recognition(ICDAR'95), pp. 218–221, Montréal, Canada, August (1995)

  41. Hu, J., Kashi, R., Lopresti, D., Nagy, G., Wilfong, G.: Why tableground-truthing is hard. In: Proceedings of the SixthInternational Conference on Document Analysis and Recognition, pp. 129–133, Seattle, WA, September (2001)

  42. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independenttable detection. In: Lopresti, D.P., Zhou, J. (eds.) Proceedingsof Document Recognition and Retrieval VII (IS&T/SPIE ElectronicImaging), vol. 3967, pp. 291–302, San Jose, CA, January (2000)

  43. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: A system forunderstanding and reformulating tables. In: Proceedings of theFourth IAPR International Workshop on Document Analysis Systems, pp. 361–372, Rio de Janeiro, Brazil, December (2000)

  44. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Table structurerecognition and its evaluation. In: Kantor, P.B., Lopresti, D.P., Zhou, J. (eds.) Proceedings of Document Recognition and RetrievalVIII (IS&T/SPIE Electronic Imaging), vol. 4307, pp. 44–55, SanJose, CA, January (2001)

  45. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Evaluating theperformance of table processing algorithms. Int. J. Doc. Anal.Recognit. 4(3):140–153 (2002)

    Article  Google Scholar 

  46. Hurst, M.: Layout and language: beyond simple text for informationinteraction—modelling the table. In: Proceedings of the SecondInternational Conference on Multimodal Interfaces, Hong Kong, January (1999)

  47. Hurst, M.: The Interpretation of Tables in Texts. Ph.D. thesis, University of Edinburgh (2000)

  48. Hurst, M., Douglas, S.: Layout and language: Preliminaryinvestigations in recognizing the structure of tables. In:Proceedings of the International Conference on Document Analysisand Recognition (ICDAR'97), pp. 1043–1047, August (1997)

  49. Insiders – knowledge-management specialists, February 2005.http://www.insiders.de/

  50. Itonori, K.: A table structure recongnition based on textblockarrangement and ruled line position. In: Proceedings of theSecond International Conference on Document Analysis andRecognition (ICDAR'93), pp. 765–768, Tsukuba Science City, Japan, October (1993)

  51. Kieninger, T., Dengel, A.: A paper-to-HTML table convertingsystem. In: Proceedings of Document Analysis Systems (DAS) 98, Nagano, Japan, November (1998)

  52. Kieninger, T., Dengel, A.: Applying the T-Recs table recognitionsystem to the business letter domain. In: Proceedings of theSixth International Conference on Document Analysis andRecognition, Seattle, WA, September (2001)

  53. Kieninger, T.G.: Table structure recognition based on robust blocksegmentation. In: Proceedings of Document Recognition V(IS&T/SPIE Electronic Imaging'98), vol. 3305, pp. 22–32, SanJose, CA, January (1998)

  54. Klein, B., Agne, S., Bagdanov, A.D.: Understanding documentanalysis and understanding (through modeling). In: Proceedings ofthe Seventh International Conference on Document Analysis andRecognition (ICDAR'03), pp. 1218–1222, Edinburgh, Scotland, August (2003)

  55. Klein, B., Agne, S., Dengel, A.: Results of a study oninvoice-reading systems in Germany. In: Marinai, S., Dengel, A.(eds.) Document Analysis Systems VI. Lecture Notes in ComputerScience, vol. 3163, pp. 451–462. Springer-Verlag, Berlin, Germany (2004)

  56. Klein, B., Dengel, A.R.: Problem-adaptable document analysis andunderstanding for high-volume applications. Int. J. Doc. Anal.Recognit. 6(3):167–180 (2004)

    Article  Google Scholar 

  57. Klein, B., Gökkus, S., Kieninger, T., Dengel, A.: Threeapproaches to “industrial” table spotting. In: Proceedings ofthe Sixth International Conference on Document Analysis andRecognition, pp. 513–517, Seattle, WA, September (2001)

  58. Kornfeld, W., Wattecamps, J.: Automatically locating, extractingand analyzing tabular data. In: Proceedings of the 21stInternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pp. 347–348, Melbourne, Australia, August (1998)

  59. Lamport, L.: LaTeX: A Document Preparation System.Addison-Wesley, Reading, MA (1985)

  60. Laurentini, A., Viada, P.: Identifying and understanding tabularmaterial in compound documents. In: Proceedings of the EleventhInternational Conference on Pattern Recognition (ICPR'92), pp.405–409, The Hague (1992)

  61. Lesk, M.: Tbl—a program to format tables. In: UNIX Programmer'sManual, vol. 2A. Bell Telephone Laboratories, Murray Hill, NJ(1979)

  62. Lopresti, D., Nagy, G.: Automated table processing: An(opinionated) survey. In: Proceedings of the Third IAPRInternational Workshop on Graphics Recognition, pp. 109–134, Jaipur, India, September (1999)

  63. Lopresti, D., Nagy, G.: A tabular survey of automated tableprocessing. In: Chhabra, A.K., Dori, D. (eds.) GraphicsRecognition: Recent Advances. Lecture Notes in ComputerScience, vol. 1941, pp. 93–120. Springer-Verlag, Berlin, Germany(2000)

  64. Luo, Q., Watanabe, T., Yoshida, Y., Inagaki, Y.: Recognition ofdocument structure on the basis of spatial and geometricrelationships between document items. In: Proceedings of MVA'90, pp. 461–464 (1990)

  65. Maedche, A., Staab, S.: Ontology learning for the semantic web.IEEE Intell. Syst. (2001)

  66. Maier, D.: The Theory of Relational Databases. Computer SciencePress Inc., Rockville, Maryland (1983)

  67. Nagy, G.: Twenty years of document image analysis in PAMI. IEEETrans. Pattern Anal. Mach. Intell. 22(1):38–62 (2000)

    Article  Google Scholar 

  68. Nagy, G., Seth, S.: Hierarchical representation of opticallyscanned documents. In: Proceedings the International Conferenceon Pattern Recognition (ICPR), pp. 347–349 (1984)

  69. Nielson, H., Barrett, W.: Consensus-based table form recognition.In: Proceedings of the Seventh International Conference onDocument Analysis and Recognition, pp. 906–910, August (2003)

  70. O'Gorman, L.: The document spectrum for structural page layoutanalysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11):1162–1173 (1993)

    Article  Google Scholar 

  71. Peterman, C., Chang, C.H., Alam, H.: A system for tableunderstanding. In: Proceedings of the Symposium on Document ImageUnderstanding Technology (SDIUT'97), pp. 55–62, Annapolis, MD, April/May 1997

  72. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extractionusing conditional random fields. In: Proceedings of the 26thAnnual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pp. 235–242 (2003)

  73. Pyreddy, P., Croft, W.B.: TINTIN: A system for retrieval in texttables. Technical Report UM-CS-1997-002, University ofMassachusetts, Amherst, January (1997)

  74. Rahgozar, M.A., Cooperman, R.: A graph-based table recognitionsystem. In: Proceedings of Document Recognition III (IS&T/SPIEElectronic Imaging'96), vol. 2660, pp. 192–203, San Jose, CA, January (1996)

  75. Ramel, J., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of theSeventh International Conference on Document Analysis andRecognition (2003)

  76. Recording for the Blind and Dyslexic, Princeton, NJ. The 1.7 TagSet Usage Guide (1994)

  77. Rus, D., Subramanian, D.: Customizing information capture andaccess. ACM Trans. Inf. Syst. 15(1):67–101 (1997)

    Article  Google Scholar 

  78. Rus, D., Summers, K.: Using white space for automated documentstructureing. Technical Report TR94-1452, Cornell University, Department of Computer Science, July (1994)

  79. ScanSoft OmniPage, February (2005)http://www.scansoft.com/omnipage/

  80. Shamalian, J.H., Baird, H.S., Wood, T.L.: A retargetable tablereader. In: Proceedings of the International Conference onDocument Analysis and Recognition (ICDAR'97), pp. 158–163, August (1997)

  81. Sproat, R., Hu, J., Chen, H.: EMU: an e-mail preprocessor fortext-to-speech. In: Proceedings of the IEEE Workshop onMultimedia Signal Processing, pp. 239–244, Los Angeles, CA, December (1998)

  82. Summers, K.: Automatic Discovery of Logical Document Structure.Ph.D. thesis, Cornell University, August (1998)

  83. TCG Informatik AG—Data capture at its best, February 2005.http://www.tcginf.ch

  84. Tijerino, Y., Embley, D., Lonsdale, D., Nagy, G.: Towards ontologygeneration from tables. World Wide Web J. (in press)

  85. TREC Data—English Test Questions (Topics).http://trec.nist.gov/data/testq_eng.html

  86. Tsuruoka, S., Takao, K., Tanaka, T., Yoshikawa, T., Shinogi, T.:Region segmentation for table image with unknown complexstructure. In: Proceedings of the Sixth International Conferenceon Document Analysis and Recognition, Seattle, WA, September (2001)

  87. Turolla, E., Belaid, Y., Belaid, A.: Form item extraction based online searching. In: Kasturi, R., Tombre, K. (eds.) GraphicsRecognition—Methods and Applications. Lecture Notes inComputer Science, vol. 1072, pp. 69–79. Springer-Verlag, Berlin, Germany (1996)

  88. Wang, X.: Tabular abstraction, editing, and formatting. Ph.D.thesis, University of Waterloo (1996)

  89. Wasserman, H., Yukawa, K., Sy, B., Kwok, K.-L., Phillips, I.T.: Atheoretical foundation and a method for document table structureextraction and decomposition. In: Lopresti, D., Hu, J., Kashi, R.(eds.) Document Analysis Systems V. Lecture Notes in ComputerScience, vol. 2423, pp. 291–294. Springer-Verlag, Berlin, Germany (2002)

  90. Watanabe, T., Fukumura, T.: A framework for validating recognizedresults in understanding table-form documents. In: Proceedings ofthe Third International Conference on Document Analysis andRecognition, pp. 536–539 (1995)

  91. Watanabe, T., Luo, Q., Sugie, N.: Towards a practical documentunderstanding of table-form documents: its framework and knowledgerepresentation. In: Proceedings of the Second InternationalConference on Document Analysis and Recognition, pp. 510–515(1993)

  92. Watanabe, T., Naruse, H., Lou, Q., Sugie, N.: Structure analysisof table-form document on the basis of the recognition of verticaland horizontal line segments. In: Proceedings of the FirstInternational Conference on Document Analysis and Recognition, pp. 638–646 (1991)

  93. Watanabe, T., Quo, Q.L., Sugie, N.: Layout recognition ofmulti-kinds of table-form documents. IEEE Trans. Pattern Anal.Mach. Intell. 17(4):432–445 (1995)

    Article  Google Scholar 

  94. Whittaker, S., Sidner, C.: Email overload: exploring personalinformation management of email. In: Proceedings of theConference on Human Factors in Computing Systems (CHI), pp.276–283, Vancouver, British Columbia, Canada, April (1996)

  95. Wright, P.: A user-oriented approach to the design of tables andflowcharts. In: Jonassen, D.H. (ed.) The Technology of Text.Educational Technology Publications (1982)

  96. XML Cities: XML Content for a New Era, February (2005)http://www.xmlcities.com

  97. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of tablerecognition: Models, observations, transformations, andinferences. Int. J. Doc. Anal. Recognit. 7(1):1–16September (2004)

    Google Scholar 

  98. Zuyev, K.: Table image segmentation. In: Proceedings of theInternational Conference on Document Analysis and Recognition(ICDAR'97), pp. 705–708, August (1997)

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Embley, D.W., Hurst, M., Lopresti, D. et al. Table-processing paradigms: a research survey. IJDAR 8, 66–86 (2006). https://doi.org/10.1007/s10032-006-0017-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-006-0017-x

Keywords