Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1816123.1816127acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

ProcessTron: efficient semi-automated markup generation for scientific documents

Published: 21 June 2010 Publication History

Abstract

Digitizing legacy documents and marking them up with XML is important for many scientific domains. However, creating comprehensive semantic markup of high quality is challenging. Respective processes consist of many steps, with automated markup generation and intermediate manual correction. These corrections are extremely laborious. To reduce this effort, this paper makes two contributions: First, it proposes ProcessTron, a lightweight markup-process-control mechanism. ProcessTron assists users in two ways: It ensures that the steps are executed in the appropriate order, and it points the user to possible errors during manual correction. Second, ProcessTron has been deployed in real-world projects, and this paper reports on our experiences. A core observation is that ProcessTron more than halves the time users need to mark up a document. Results from laboratory experiments, which we have conducted as well, confirm this finding.

References

[1]
Brazma, A. et al. Standards for systems biology. Nature Reviews Genetics 7, pp. 593--605, 2006.
[2]
Business Process Execution Language (BPEL) http://www.bpelsource.com/bpel_info/spec.html
[3]
Catapano, T. et al. TaxonX: A Lightweight and Flexible XML Schema for Mark-up of Taxonomic Treatments. In Proceedings of TDWG 2006, St. Louis, MO, USA, 2006.
[4]
J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Erlbaum, Hillsdale, NJ, USA, 1988, ISBN 0-8058-0283-5
[5]
Kim, J.-D. et al. GENIA corpus - a semantically annotated corpus for bio-text-mining. Bioinformatics, pp. i180--i182, Oxford University Press, 2003.
[6]
Kolawa, A.; Huizinga, D. Automated Defect Prevention: Best Practices in Software Management. Wiley-IEEE Computer Society Press, 2007
[7]
Marcus, M. P. et al. A. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, Vol. 19, No. 2, pp. 313--330, 1994.
[8]
Mikheev, A. et al. Named Entity Recognition without Gazetteers, in Proceedings of EACL, Bergen, Norway, 1999
[9]
Metadata Object Description Schema. http://www.loc.gov/standards/mods/
[10]
Sautter, G. et al. Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor. In Proceedings of European Conference on Research and Advances in Digital Libraries, Budapest, Hungary, 2007.
[11]
Sautter, G. et al. Creating Digital Resources from Legacy Documents - an Experience Report from the Biosystematics Domain, in Proceedings of ESWC, Heraklion, Greece, 2009
[12]
The Schematron Assertion Language http://xml.ascc.net/resource/schematron/Schematron2000.html
[13]
Van der Aalst, W. M. et al. Workflow Patterns, Distributed and Parallel Databases 14(1): pp. 5--51, 2003
[14]
Van der Aalst, W. M., van Hee, K. Workflow Management: Models, Methods, and Systems. The MIT Press, Cambridge, Massachusetts, 2004
[15]
XML Path Language. http://www.w3.org/TR/xpath
[16]
XML Schema http://www.w3.org/XML/Schema

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries
June 2010
424 pages
ISBN:9781450300858
DOI:10.1145/1816123
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-driven markup process control
  2. semantic xml markup

Qualifiers

  • Research-article

Conference

JCDL10
Sponsor:
JCDL10: Joint Conference on Digital Libraries
June 21 - 25, 2010
Queensland, Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 157
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media