Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing

  • Conference paper
Knowledge-Based Intelligent Information and Engineering Systems (KES 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4692))

Abstract

Building a collection of electronic documents, e.g. corpus, is a cornerstone for the research in information retrieval, text mining and knowledge management. In literature, very few papers have discussed the necessary concerns for building a corpus and explained the building process systematically. In this paper, we explain our work of building an enterprise corpus called manufacturing corpus version 1 (MCV1) for corporate knowledge management purpose. Relevant issues, e.g. input texts, category labels and policies, as well as its parallel coding process and quality measurements are discussed. The real-world automated text classification experiments based on MCV1 show the soundness of its coding process. Finally, suggestions are made on how the proposed approach can be implemented in a more economical manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA (1999)

    Google Scholar 

  2. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. (eds.) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, CA, USA (1996)

    Google Scholar 

  3. Hearst, M.A.: Untangling Text Data Mining. In: Proceedings of ACL’99, the 37th Annual Meeting of the Association for Computational Linguistics, invited paper (1999)

    Google Scholar 

  4. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR’94) (1994)

    Google Scholar 

  5. Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  7. Mitchell, T.M.: Machine learning and data mining. Communications of the ACM 42, 30–36 (1999)

    Article  Google Scholar 

  8. Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)

    Google Scholar 

  9. Rose, T., Stevenson, M., Whitehead, M.: The Reuters Corpus Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources. In: The third international conference on language resource and evaluation (2002)

    Google Scholar 

  10. Rose, T., Whitehead, M.: Private communication: RCV1 building (2003)

    Google Scholar 

  11. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)

    Article  Google Scholar 

  12. Ulrich, K.T., Eppinger, S.D.: Product Design and Development, 2nd edn. McGraw-Hill, New York, USA (2000)

    Google Scholar 

  13. Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)

    Google Scholar 

  14. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bruno Apolloni Robert J. Howlett Lakhmi Jain

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, Y., Loh, H.T. (2007). Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing. In: Apolloni, B., Howlett, R.J., Jain, L. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2007. Lecture Notes in Computer Science(), vol 4692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74819-9_67

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74819-9_67

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74817-5

  • Online ISBN: 978-3-540-74819-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics