Abstract
In this paper, we present an ontology of disease related concepts that is designated for detection of disease incidence in tweets. Unlike previous key word based systems and topic modeling approaches, our ontological approach allows us to apply more stringent criteria for determining which messages are relevant such as spatial and temporal characteristics whilst giving a stronger guarantee that the resulting models will perform well on new data that may be lexically divergent. We achieve this by training supervised learners on concepts rather than individual words. Effectively, we map every possible word to a fixed length lexicon thereby eliminating lexical divergence between training data and new data. For training we use a dataset containing mentions of influenza, common cold and Listeria and use the learned models to classify datasets containing mentions of an arbitrary selection of other diseases. We show that our ontological approach results in models whose performance is not only good but also stable on lexically divergent data versus a word-level lookup unigram, bag of words baseline approach. We also show that word vectors can be learned directly from our concepts to achieve even better results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
General architecture for text engineering.
References
Lee, K., Agrawal, A., Choudary, A.: Real time disease surveillance using twitter data: case study flu and cancer. In: ACM, Chicago, Illinois, USA, pp. 1474–1477 (2013)
Google Inc, https://www.google.org/flutrends/about/
Paul, M.J., Dredze, M.: Discovering health topics in social media using topic models. PLoS ONE 9, 8 (2014)
Lampos, V., Cristianini, N.: Tracking the flu pandemic by monitoring the social web, pp. 411–416. IEEE, Naregno, Elba island, Italy (2010)
Collier, N., Doan, S., Kawazoe, A., Goodwin, R.M., Conway, M., Tateno, Y., et al.: Biocaster: detecting public health rumors with a web-based text mining system. Bioinform. 24(24), 2940–2941 (2008)
Okhmatovskaia, A., Chapman, W., Collier, N., Espino, J., Buckeridge, D.L.: SSO: The Syndromic Surveillance Ontology https://www.bioontology.org/sites/default/files/SSO.pdf
Porta, M.: A Dictionary of Epidemiology. Oxford University Press, New York (2008)
Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., et al.: The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotech. 25, 1251–1255 (2007)
Osborne, J.D., Flatow, J., Holko, M., Lin, S.M., Kibbe, W.A., Zhue, L., et al.: Annotating the human genome with disease ontology. BMC Genom. 10, 1 (2009)
Pesquira, C., Ferreira, J.D., Couto, M.F., Silva, M.J.: The epidemiology ontology: an ontology for semantic annotation of epidemiological resources. J. Biomed. Semant. 5, 4 (2014)
Clark, T., Ciccarese, P.N., Goble, C.A.: Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. J. Biomed. Semant. 5(1), 1–33 (2014)
Elliott, J., Mavergames, C., Becker, L., Meerpohl, J., Thomas, J., Gruen, R., Tovey, D.: Achieving high quality and efficient systematic review through technological innovation. BMJ Rapid Response (2013) http://www.bmj.com/content/346/bmj.f139/rr/625503
Smith, B., Fellbaum, C.: Medical Wordnet: A New Methodology for the Construction and Validation of Information Resources for Consumer Health, p. 371. ACM, Geneva (2004)
Taylor, A., Marcus, M., Santorini, B.: The Penn Treebank: An Overview. In: Abeille, A. (ed.) Treebanks. Building and Using Parsed Corpora, pp. 5–22. Springer, Netherlands (2003)
Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In: ACL, Hisar, Bulgaria, pp. 198–206 (2013)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: an architecture for development of robust HLT applications. In: ACL, Philadelphia, USA, pp. 168–175 (2002)
Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: ACL, Hong Kong, pp. 63–70 (2000)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: ACM, Edmonton, Canada, pp. 252–259 (2003)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation Of Word Representations In Vector Space. Google Curran Associates Inc., Arizona, USA (2013)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: JMLR Workshop and Conference Proceedings, Beijing, China, pp. 1188–1196 (2014)
Rehurek, R., Sojka, P.: Software Framework for Topic Modeling with Large Corpora, pp. 46–50. University of Malta Valetta, Malta (2010)
Pedregrosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. 12, 2825–2830 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Magumba, M.A., Nabende, P. (2017). An Ontology for Generalized Disease Incidence Detection on Twitter. In: MartÃnez de Pisón, F., Urraca, R., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2017. Lecture Notes in Computer Science(), vol 10334. Springer, Cham. https://doi.org/10.1007/978-3-319-59650-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-59650-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59649-5
Online ISBN: 978-3-319-59650-1
eBook Packages: Computer ScienceComputer Science (R0)