Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2897022.2897029acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Cost-efficient quality assurance of natural language processing tools through continuous monitoring with continuous integration

Published: 14 May 2016 Publication History

Abstract

More and more modern applications make use of natural language data, e. g. Information Extraction (IE) or Question Answering (QA) systems. Those application require preprocessing through Natural Language Processing (NLP) pipelines, and the output quality of these applications depends on the output quality of NLP pipelines. If NLP pipelines are applied in different domains, the output quality decreases and the application requires domain specific NLP training to improve the output quality.
Adapting NLP tools to specific domains is a time-consuming and expensive task, inducing two key questions: a) how many documents need to be annotated to reach good output quality and b) what NLP tools build the best performing NLP pipeline? In this paper we demonstrate a monitoring system based on principles of Continuous Integration which addresses those questions and guides IE or QA application developers to build high quality NLP pipelines in a cost-efficient way. This monitoring system is based on many common tools, used in many software engineering projects.

References

[1]
N. Barrett and J. H. Weber-Jahnke. "Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm". In: BMC Bioinformatics 12.S-3 (2011).
[2]
A. M. Berg. Jenkins Continuous Integration Cookbook. 2nd Edition. Packt Publishing, 2015. ISBN: 978-1-78439-008-2.
[3]
K. Bontcheva et al. "TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text". In: Proceedings of the International Conference on Recent Advances in Natural Language Processing. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, 2013, pp. 83--90.
[4]
E. Buyko et al. "Automatically adapting an NLP core engine to the biology domain". In: Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data Mining in Association with ISMB. 2006, pp. 65--68.
[5]
E. Cambria and B. White. "Jumping NLP Curves: A Review of Natural Language Processing Research". In: IEEE Computational Intelligence Magazine 9.2 (2014), pp. 48--57.
[6]
R. Caserta. The Data Warehouse ETL Toolkit. Wiley India Pvt. Limited, 2004. ISBN: 978-8-12650-554-8.
[7]
R. E. de Castilho and I. Gurevych. "A broad-coverage collection of portable NLP components for building shareable analysis pipelines". In: Proceedings of theWorkshop on Open Infrastructures and Analysis Frameworks for HLT at COLING 2014. Ed. by N. Ide and J. Grivolla. Dublin, Ireland: Association for Computational Linguistics and Dublin City University, 2014, pp. 1--11.
[8]
M. Cohn. Succeeding with Agile: Software Development Using Scrum. Pearson Education, 2009. ISBN: 978-0-32166-056-5.
[9]
P. Duvall, S. M. Matyas, and A. Glover. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley Professional, 2007. ISBN: 0-32133-638-0.
[10]
J. P. Ferraro et al. "Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation". In: Journal of the American Medical Informatics Association (2013).
[11]
O. Gaudin. SonarQube in Action. Manning, 2013. ISBN: 978-1-61729-095-4.
[12]
E. Giesbrecht and S. Evert. "Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus". In: Proceedings of the 5th Web as Corpus Workshop. WAC5. 2009, pp. 27--35.
[13]
J. Holck and N. Jørgensen. "Continuous Integration and Quality Assurance: A Case Study of Two Open Source Projects". In: Australasian Journal of Information Systems (2007), pp. 40--53.
[14]
M. Karlesky et al. "Mocking the Embedded World: Test-Driven Development, Continuous Integration, and Design Patterns". In: Proceeding of Embedded Systems Conference. Silicon Valley, 2007.
[15]
J.-D. Kim et al. "GENIA corpusa semantically annotated corpus for bio-textmining". In: Bioinformatics 19 (2003), pp. i180--i182.
[16]
R. Klinger et al. "Detection of IUPAC and IUPAC-like Chemical Names". In: Bioinformatics 24.13 (2008), pp. i268--i276. ISSN: 1367-4803.
[17]
M. Krapivin, A. Autayeu, and M. Marchese. Large Dataset for Keyphrases Extraction. Tech. rep. DISI-09-055. DISI, Trento, Italy, 2008.
[18]
W. Lewis and S. Eetemadi. "Dramatically Reducing Training Data Size through Vocabulary Saturation". In: Proceedings of the Eighth Workshop on Statistical Machine Translation, ACL 2013. ACL, 2013.
[19]
T. Lingren et al. "Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements." In: Journal of the American Medical Informatics Association (2013). ISSN: 1527-974X.
[20]
H. Liu et al. "BioLemmatizer: a lemmatization tool for morphological processing of biomedical text". In: Journal of biomedical semantics 3 (2012).
[21]
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. "Building a Large Annotated Corpus of English: The Penn Treebank". In: Computational Linguistics 19.2 (1993), pp. 313--330. ISSN: 0891-2017.
[22]
M. Neunerdt et al. "Part-Of-Speech Tagging for Social Media Texts". In: Language Processing and Knowledge in the Web. Ed. by I. Gurevych, C. Biemann, and T. Zesch. Vol. 8105. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013, pp. 139--150.
[23]
S. Petrov, D. Das, and R. McDonald. "A Universal Part-of-Speech Tagset". In: Proceedings of the Eight International Conference on Language Resources and Evaluation. 2012.
[24]
J. Pustejovsky and A. Stubbs. Natural Language Annotation for Machine Learning. O'Reilly Media, 2012. ISBN: 978-1-4493-0666-3.
[25]
E. Raymond. "The Cathedral and the Bazaar". In: Knowledge, Technology & Policy 12.3 (1999), pp. 23--49.
[26]
R. M. Reese. Natural Language Processing with Java. Packt Publishing Ltd, 2015. ISBN: 978-1-78439-894-1.
[27]
T. Rocktäschel et al. "WBI-NER: The impact of domain-specific features on the performance of identifying and classifying mentions of drugs". In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: Association for Computational Linguistics, 2013, pp. 356--363.
[28]
M. Schreiber, K. Barkschat, and B. Kraft. "Using Continuous Integration to Organize and Monitor the Annotation Process of Domain Specific Corpora". In: 5th International Conference on Information and Communication Systems. Irbid, Jordan: IEEE, 2014, pp. 1--6. 6841958.
[29]
M. Schreiber, B. Kraft, and A. Zündorf. "Testing Vast Amounts of Natural Language Processing Pipelines in Continuous Integration Environments". In: 3th Symposium of Applied Graph Technology. Dortmund, Germany, pp. 1--6. Forthcoming.
[30]
M. Schreiber et al. "Quick Pad Tagger: An Efficient Graphical User Interface for Building Annotated Corpora with Multiple Annotation Layers". In: Computer Science & Information Technology 4 (2015). Presented at NLP 2015 in Sydney, Australia, pp. 131--143.
[31]
A. P. Silva, A. Silva, and I. Rodrigues. "A New Approach to the POS Tagging Problem Using Evolutionary Computation". In: Proceedings of the International Conference Recent Advances in Natural Language Processing. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, 2013, pp. 619--625.
[32]
H. Wachsmuth, B. Stein, and G. Engels. "Constructing Efficient Information Extraction Pipelines". In: Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24--28, 2011. 2011, pp. 2237--2240.
[33]
D. C. Wimalasuriya and D. Dou. "Ontology-based Information Extraction: An Introduction and a Survey of Current Approaches". In: Journal of Information Science 36.3 (2010), pp. 306--323.

Cited By

View all
  • (2021)STAMP 4 NLP – An Agile Framework for Rapid Quality-Driven NLP Applications DevelopmentQuality of Information and Communications Technology10.1007/978-3-030-85347-1_12(156-166)Online publication date: 25-Aug-2021
  • (2020)A Survey on Industrial Information Integration 2016–2019Journal of Industrial Integration and Management10.1142/S242486221950016705:01(33-163)Online publication date: 20-Feb-2020
  • (2018)Obtaining Exhaustive Answer Set for Q&A-based Inquiry System using Customer Behavior and Service Function ModelingProcedia Computer Science10.1016/j.procs.2018.08.033126(986-995)Online publication date: 2018
  • Show More Cited By
  1. Cost-efficient quality assurance of natural language processing tools through continuous monitoring with continuous integration

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SER&IP '16: Proceedings of the 3rd International Workshop on Software Engineering Research and Industrial Practice
    May 2016
    69 pages
    ISBN:9781450341707
    DOI:10.1145/2897022
    © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 May 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICSE '16
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)STAMP 4 NLP – An Agile Framework for Rapid Quality-Driven NLP Applications DevelopmentQuality of Information and Communications Technology10.1007/978-3-030-85347-1_12(156-166)Online publication date: 25-Aug-2021
    • (2020)A Survey on Industrial Information Integration 2016–2019Journal of Industrial Integration and Management10.1142/S242486221950016705:01(33-163)Online publication date: 20-Feb-2020
    • (2018)Obtaining Exhaustive Answer Set for Q&A-based Inquiry System using Customer Behavior and Service Function ModelingProcedia Computer Science10.1016/j.procs.2018.08.033126(986-995)Online publication date: 2018
    • (2017)Metrics driven research collaborationProceedings of the 4th International Workshop on Software Engineering Research and Industrial Practice10.1109/SER-IP.2017..6(41-47)Online publication date: 20-May-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media