Abstract
Building a collection of electronic documents, e.g. corpus, is a cornerstone for the research in information retrieval, text mining and knowledge management. In literature, very few papers have discussed the necessary concerns for building a corpus and explained the building process systematically. In this paper, we explain our work of building an enterprise corpus called manufacturing corpus version 1 (MCV1) for corporate knowledge management purpose. Relevant issues, e.g. input texts, category labels and policies, as well as its parallel coding process and quality measurements are discussed. The real-world automated text classification experiments based on MCV1 show the soundness of its coding process. Finally, suggestions are made on how the proposed approach can be implemented in a more economical manner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA (1999)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. (eds.) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, CA, USA (1996)
Hearst, M.A.: Untangling Text Data Mining. In: Proceedings of ACL’99, the 37th Annual Meeting of the Association for Computational Linguistics, invited paper (1999)
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR’94) (1994)
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, Springer, Heidelberg (1998)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Mitchell, T.M.: Machine learning and data mining. Communications of the ACM 42, 30–36 (1999)
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)
Rose, T., Stevenson, M., Whitehead, M.: The Reuters Corpus Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources. In: The third international conference on language resource and evaluation (2002)
Rose, T., Whitehead, M.: Private communication: RCV1 building (2003)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)
Ulrich, K.T., Eppinger, S.D.: Product Design and Development, 2nd edn. McGraw-Hill, New York, USA (2000)
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, Y., Loh, H.T. (2007). Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing. In: Apolloni, B., Howlett, R.J., Jain, L. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2007. Lecture Notes in Computer Science(), vol 4692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74819-9_67
Download citation
DOI: https://doi.org/10.1007/978-3-540-74819-9_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74817-5
Online ISBN: 978-3-540-74819-9
eBook Packages: Computer ScienceComputer Science (R0)