Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/319950.319961acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article
Free access

Automatically extracting structure and data from business reports

Published: 01 November 1999 Publication History
  • Get Citation Alerts
  • Abstract

    A considerable amount of clean semistructured data is internally available to companies in the form of business reports. However, business reports are untapped for data mining, data warehousing, and querying because they are not in relational form. Business reports have a regular structure that can be reconstructed. We present algorithms that automatically infer the regular structure underlying business reports and automatically generate wrappers to extract relational data.

    References

    [1]
    Microsoft Corporation Access product home page. .microsoft. tom/access.
    [2]
    B. Adelberg. NoDoSE---a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the A CM SIC- MOD International Conference on Management of Data, $IGMOD'98, pages 283-294, Seattle, Washington, June 1998.
    [3]
    A.V. Aho, B.W. Kernighan, and P.J. Weinberger. The A WK Programming Language. Addison-Wesley, Reading, Massachusetts, 1988.
    [4]
    A.V. Aho, R. Sethi, and J.D. Ullman. Compilers, principles, techniques, and tools. Addison-Wesley, Reading, Massachusetts, 1986.
    [5]
    N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. $IGMOD Record, 26(4):8-15, December 1997.
    [6]
    Bruce Silver Associates. Mining mainframe reports: Intelligent data extraction from print streams, October 1997. http://w~-w.iacorporation.com/assets/- press_releases/InfoXtract_White_Paper.html.
    [7]
    H.$. Baird. Background structure in document images. International Journal of Pattern Recognition and Artifical Intelligence, 8(5):1013-1030, 1994.
    [8]
    Data Junction Corporation home page. w~-w. datajunction. com.
    [9]
    R. Chung and K.L. Leung. An iterative clustering algorithm for interpretation of imperfect line drawings. International Journal of Pattern Recognition and A rtifical Intelligence, 10(8):867-886, 1996.
    [10]
    Data Extraction Group home page. w~.deg.byu.edu.
    [11]
    R. Doorenbos, O. Etzioni, and D. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, pages 39-48, Marina del Ray, California, February 1997.
    [12]
    D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.K. Ng, and R.D. Smith. Conceptualmodel-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, page to appear, November 1999.
    [13]
    D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, Y.K. Ng, D.W. Quass, and R.D. Smith. A conceptualmodeling approach to extracting data from the web. In Proceedings of the 17th International Conference on Conceptual Modeling, ER'98, Lecture Notes in Computer Science, 1507, pages 78-91, Singapore, November 1998. Springer Verlag.
    [14]
    D.W. Embley, D.M. Campbell, R.D. Smith, and S.W. Liddle. Ontology-based extraction and structuring of information from data-rich unstructured documents. In Proceedings of the 1998 A CM CIKM Seventh International Conference on ~nfownation and Knowledge Management (CIKM'98), pages 52-59, Bethesda, Maxyland, November 1998.
    [15]
    IA Corporation home page. w~. iacorporation, com.
    [16]
    N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI'97, pages 729-735, Nagoya, Japan, August 1997.
    [17]
    L. Miclet. Regular inference with a tail clustering method. IEEE Transactions on Systems, Man and Cybernetics, 9:7'37-743, 1979.
    [18]
    L. Miclet. Structural Methods in Pattern Recognition. North Oxford Academic Publishers Ltd, London, 1986.
    [19]
    DataWatch Corporation home page. w~-~.datawatch.- COm.
    [20]
    mySQL home page. www.mysql, com.
    [21]
    Savarese. Org home page. www. savarese, org.
    [22]
    S. Soderland. Learning to extract text-based information from the World-Wide Web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, KDD-97, pages 251-254, Newport Beach, California, August 1997.
    [23]
    twzljdbcForMysql home page. w,m. voicenet, com/- 'zellert/tjFM.
    [24]
    R.A. Wagner and M.J. Fisher. The string to string correction problem. Journal of the ACM, 21(1):168- 173, 1974.
    [25]
    L. Wall and R.L. Schwartz. Programming Perl. O'Reilly and Associates, Sebastopol, California, 1991.

    Cited By

    View all
    • (2005)Extract List Data from Semi-structured Document Using Clustering2005 International Conference on Natural Language Processing and Knowledge Engineering10.1109/NLPKE.2005.1598800(559-564)Online publication date: 2005
    • (2004)Constraint-based wrapper specification and verification for cooperative information systemsInformation Systems10.1016/j.is.2003.12.00629:7(617-636)Online publication date: 1-Sep-2004

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '99: Proceedings of the eighth international conference on Information and knowledge management
    November 1999
    564 pages
    ISBN:1581131461
    DOI:10.1145/319950
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 1999

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automatic wrapper generation
    2. business reports
    3. data and information extraction
    4. regular expressions
    5. report structure

    Qualifiers

    • Article

    Conference

    CIKM99
    Sponsor:
    CIKM99: Conference on Information and Knowledge Management
    November 2 - 6, 1999
    Missouri, Kansas City, USA

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2005)Extract List Data from Semi-structured Document Using Clustering2005 International Conference on Natural Language Processing and Knowledge Engineering10.1109/NLPKE.2005.1598800(559-564)Online publication date: 2005
    • (2004)Constraint-based wrapper specification and verification for cooperative information systemsInformation Systems10.1016/j.is.2003.12.00629:7(617-636)Online publication date: 1-Sep-2004

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media