research-article

Cost-efficient quality assurance of natural language processing tools through continuous monitoring with continuous integration

Authors:

Marc Schreiber,

Albert ZündorfAuthors Info & Claims

SER&IP '16: Proceedings of the 3rd International Workshop on Software Engineering Research and Industrial Practice

Pages 46 - 52

https://doi.org/10.1145/2897022.2897029

Published: 14 May 2016 Publication History

Abstract

More and more modern applications make use of natural language data, e. g. Information Extraction (IE) or Question Answering (QA) systems. Those application require preprocessing through Natural Language Processing (NLP) pipelines, and the output quality of these applications depends on the output quality of NLP pipelines. If NLP pipelines are applied in different domains, the output quality decreases and the application requires domain specific NLP training to improve the output quality.

Adapting NLP tools to specific domains is a time-consuming and expensive task, inducing two key questions: a) how many documents need to be annotated to reach good output quality and b) what NLP tools build the best performing NLP pipeline? In this paper we demonstrate a monitoring system based on principles of Continuous Integration which addresses those questions and guides IE or QA application developers to build high quality NLP pipelines in a cost-efficient way. This monitoring system is based on many common tools, used in many software engineering projects.

References

[1]

N. Barrett and J. H. Weber-Jahnke. "Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm". In: BMC Bioinformatics 12.S-3 (2011).

[2]

A. M. Berg. Jenkins Continuous Integration Cookbook. 2nd Edition. Packt Publishing, 2015. ISBN: 978-1-78439-008-2.

[3]

K. Bontcheva et al. "TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text". In: Proceedings of the International Conference on Recent Advances in Natural Language Processing. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, 2013, pp. 83--90.

[4]

E. Buyko et al. "Automatically adapting an NLP core engine to the biology domain". In: Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data Mining in Association with ISMB. 2006, pp. 65--68.

[5]

E. Cambria and B. White. "Jumping NLP Curves: A Review of Natural Language Processing Research". In: IEEE Computational Intelligence Magazine 9.2 (2014), pp. 48--57.

Digital Library

[6]

R. Caserta. The Data Warehouse ETL Toolkit. Wiley India Pvt. Limited, 2004. ISBN: 978-8-12650-554-8.

[7]

R. E. de Castilho and I. Gurevych. "A broad-coverage collection of portable NLP components for building shareable analysis pipelines". In: Proceedings of theWorkshop on Open Infrastructures and Analysis Frameworks for HLT at COLING 2014. Ed. by N. Ide and J. Grivolla. Dublin, Ireland: Association for Computational Linguistics and Dublin City University, 2014, pp. 1--11.

[8]

M. Cohn. Succeeding with Agile: Software Development Using Scrum. Pearson Education, 2009. ISBN: 978-0-32166-056-5.

Digital Library

[9]

P. Duvall, S. M. Matyas, and A. Glover. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley Professional, 2007. ISBN: 0-32133-638-0.

Digital Library

[10]

J. P. Ferraro et al. "Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation". In: Journal of the American Medical Informatics Association (2013).

[11]

O. Gaudin. SonarQube in Action. Manning, 2013. ISBN: 978-1-61729-095-4.

Digital Library

[12]

E. Giesbrecht and S. Evert. "Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus". In: Proceedings of the 5th Web as Corpus Workshop. WAC5. 2009, pp. 27--35.

[13]

J. Holck and N. Jørgensen. "Continuous Integration and Quality Assurance: A Case Study of Two Open Source Projects". In: Australasian Journal of Information Systems (2007), pp. 40--53.

[14]

M. Karlesky et al. "Mocking the Embedded World: Test-Driven Development, Continuous Integration, and Design Patterns". In: Proceeding of Embedded Systems Conference. Silicon Valley, 2007.

[15]

J.-D. Kim et al. "GENIA corpusa semantically annotated corpus for bio-textmining". In: Bioinformatics 19 (2003), pp. i180--i182.

[16]

R. Klinger et al. "Detection of IUPAC and IUPAC-like Chemical Names". In: Bioinformatics 24.13 (2008), pp. i268--i276. ISSN: 1367-4803.

Digital Library

[17]

M. Krapivin, A. Autayeu, and M. Marchese. Large Dataset for Keyphrases Extraction. Tech. rep. DISI-09-055. DISI, Trento, Italy, 2008.

[18]

W. Lewis and S. Eetemadi. "Dramatically Reducing Training Data Size through Vocabulary Saturation". In: Proceedings of the Eighth Workshop on Statistical Machine Translation, ACL 2013. ACL, 2013.

[19]

T. Lingren et al. "Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements." In: Journal of the American Medical Informatics Association (2013). ISSN: 1527-974X.

[20]

H. Liu et al. "BioLemmatizer: a lemmatization tool for morphological processing of biomedical text". In: Journal of biomedical semantics 3 (2012).

[21]

M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. "Building a Large Annotated Corpus of English: The Penn Treebank". In: Computational Linguistics 19.2 (1993), pp. 313--330. ISSN: 0891-2017.

Digital Library

[22]

M. Neunerdt et al. "Part-Of-Speech Tagging for Social Media Texts". In: Language Processing and Knowledge in the Web. Ed. by I. Gurevych, C. Biemann, and T. Zesch. Vol. 8105. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013, pp. 139--150.

[23]

S. Petrov, D. Das, and R. McDonald. "A Universal Part-of-Speech Tagset". In: Proceedings of the Eight International Conference on Language Resources and Evaluation. 2012.

[24]

J. Pustejovsky and A. Stubbs. Natural Language Annotation for Machine Learning. O'Reilly Media, 2012. ISBN: 978-1-4493-0666-3.

[25]

E. Raymond. "The Cathedral and the Bazaar". In: Knowledge, Technology & Policy 12.3 (1999), pp. 23--49.

[26]

R. M. Reese. Natural Language Processing with Java. Packt Publishing Ltd, 2015. ISBN: 978-1-78439-894-1.

[27]

T. Rocktäschel et al. "WBI-NER: The impact of domain-specific features on the performance of identifying and classifying mentions of drugs". In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: Association for Computational Linguistics, 2013, pp. 356--363.

[28]

M. Schreiber, K. Barkschat, and B. Kraft. "Using Continuous Integration to Organize and Monitor the Annotation Process of Domain Specific Corpora". In: 5th International Conference on Information and Communication Systems. Irbid, Jordan: IEEE, 2014, pp. 1--6. 6841958.

[29]

M. Schreiber, B. Kraft, and A. Zündorf. "Testing Vast Amounts of Natural Language Processing Pipelines in Continuous Integration Environments". In: 3th Symposium of Applied Graph Technology. Dortmund, Germany, pp. 1--6. Forthcoming.

[30]

M. Schreiber et al. "Quick Pad Tagger: An Efficient Graphical User Interface for Building Annotated Corpora with Multiple Annotation Layers". In: Computer Science & Information Technology 4 (2015). Presented at NLP 2015 in Sydney, Australia, pp. 131--143.

[31]

A. P. Silva, A. Silva, and I. Rodrigues. "A New Approach to the POS Tagging Problem Using Evolutionary Computation". In: Proceedings of the International Conference Recent Advances in Natural Language Processing. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, 2013, pp. 619--625.

[32]

H. Wachsmuth, B. Stein, and G. Engels. "Constructing Efficient Information Extraction Pipelines". In: Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24--28, 2011. 2011, pp. 2237--2240.

Digital Library

[33]

D. C. Wimalasuriya and D. Dou. "Ontology-based Information Extraction: An Introduction and a Survey of Current Approaches". In: Journal of Information Science 36.3 (2010), pp. 306--323.

Digital Library

Cited By

Kohl PSchmidts OKlöser LWerth HKraft BZündorf A(2021)STAMP 4 NLP – An Agile Framework for Rapid Quality-Driven NLP Applications DevelopmentQuality of Information and Communications Technology10.1007/978-3-030-85347-1_12(156-166)Online publication date: 25-Aug-2021
https://doi.org/10.1007/978-3-030-85347-1_12
Chen Y(2020)A Survey on Industrial Information Integration 2016–2019Journal of Industrial Integration and Management10.1142/S242486221950016705:01(33-163)Online publication date: 20-Feb-2020
https://doi.org/10.1142/S2424862219500167
Takeuchi HMasuda SMiyamoto KAkihara S(2018)Obtaining Exhaustive Answer Set for Q&A-based Inquiry System using Customer Behavior and Service Function ModelingProcedia Computer Science10.1016/j.procs.2018.08.033126(986-995)Online publication date: 2018
https://doi.org/10.1016/j.procs.2018.08.033
Show More Cited By

Cost-efficient quality assurance of natural language processing tools through continuous monitoring with continuous integration
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Introduction to Chinese Natural Language Processing
Towards a cascade of morpho-syntactic tools for arabic natural language processing
CICLing'10: Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

This paper presents a cascade of morpho-syntactic tools to deal with Arabic natural language processing. It begins with the description of a large coverage formalization of the Arabic lexicon. The built electronic dictionary, named "El-DicAr", which ...
Integration of an XML electronic dictionary with linguistic tools for natural language processing

This study proposes the codification of lexical information in electronic dictionaries, in accordance with a generic and extendable XML scheme model, and its conjunction with linguistic tools for the processing of natural language. Our approach is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SER&IP '16: Proceedings of the 3rd International Workshop on Software Engineering Research and Industrial Practice

May 2016

69 pages

ISBN:9781450341707

DOI:10.1145/2897022

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

ACM: Association for Computing Machinery
SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS\DATC: IEEE Computer Society
TCSE: IEEE Computer Society's Tech. Council on Software Engin.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Bundesministerium für Bildung und Forschung

Conference

ICSE '16

Sponsor:

ACM
SIGSOFT
IEEE-CS\DATC
TCSE

ICSE '16: 38th International Conference on Software Engineering

May 14 - 22, 2016

Texas, Austin

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
129
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kohl PSchmidts OKlöser LWerth HKraft BZündorf A(2021)STAMP 4 NLP – An Agile Framework for Rapid Quality-Driven NLP Applications DevelopmentQuality of Information and Communications Technology10.1007/978-3-030-85347-1_12(156-166)Online publication date: 25-Aug-2021
https://doi.org/10.1007/978-3-030-85347-1_12
Chen Y(2020)A Survey on Industrial Information Integration 2016–2019Journal of Industrial Integration and Management10.1142/S242486221950016705:01(33-163)Online publication date: 20-Feb-2020
https://doi.org/10.1142/S2424862219500167
Takeuchi HMasuda SMiyamoto KAkihara S(2018)Obtaining Exhaustive Answer Set for Q&A-based Inquiry System using Customer Behavior and Service Function ModelingProcedia Computer Science10.1016/j.procs.2018.08.033126(986-995)Online publication date: 2018
https://doi.org/10.1016/j.procs.2018.08.033
Schreiber MKraft BZündorf AShukla RSen SBishop JBreitman K(2017)Metrics driven research collaborationProceedings of the 4th International Workshop on Software Engineering Research and Industrial Practice10.1109/SER-IP.2017..6(41-47)Online publication date: 20-May-2017
https://dl.acm.org/doi/10.1109/SER-IP.2017..6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten