Abstract
Document classification is challenging due to handling of voluminous and highly non-linear data, generated exponentially in the era of digitization. Proper representation of documents increases efficiency and performance of classification, ultimate goal of retrieving information from large corpus. Deep neural network models learn features for document classification unlike the engineered feature based approaches where features are extracted or selected from the data. In the paper we investigate performance of different classifiers based on the features obtained using two approaches. We apply deep autoencoder for learning features while engineering features are extracted by exploiting semantic association within the terms of the documents. Experimentally it has been observed that learning feature based classification always perform better than the proposed engineering feature based classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5(2), 157–166 (1994)
Cachopo, A.: Improving methods for single-label text categorization. Ph.D. thesis, Universidade Tecnica de Lisboa (2007)
Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)
Gehler, P.V., Holub, A.D., Welling, M.: The rate adapting poisson model for information retrieval and object recognition. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 337–344. ACM (2006)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Hinton, G.E.: To recognize shapes, first learn to generate images. Prog. Brain Res. 165, 535–547 (2007)
Hinton, G.E.: Deep belief networks. Scholarpedia 4(5), 5947 (2009)
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hinton, G.E., Salakhutdinov, R.R.: Replicated softmax: an undirected topic model. In: Advances in Neural Information Processing Systems, pp. 1607–1614 (2009)
Joachims, T.: Learning to Classify Text Using Support Vector Machines: Methods. Kluwer Academic Publishers, Theory and Algorithms (2002)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 11–21 (1972)
Jordan, M.I.: Serial order: a parallel distributed processing approach. Adv. Psychol. 121, 471–495 (1997)
Jurafsky, D.: Speech & Language Processing. Pearson Education, India (2000)
Meilă, M.: Comparing clusterings-an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)
Messerly, J.J., Heidorn, G.E., Richardson, S.D., Dolan, W.B., Jensen, K.: Information retrieval utilizing semantic representation of text, 13. US Patent 6,076,051., June 2000
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp. 3111–3119 (2013)
Mohamed, A.R., Sainath, T.N., Dahl, G., Ramabhadran, B., Hinton, G.E., Picheny, M., et al.: Deep belief networks using discriminative features for phone recognition. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5060–5063. IEEE (2011)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference On Machine Learning (2003)
Rojas, R.: Neural Networks: A Systematic Introduction. Springer, Heidelberg (2013)
Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine learning, pp. 791–798. ACM (2007)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybernet. 1(1–4), 43–52 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sen, A., Ghosh, S., Kundu, D., Sarkar, D., Sil, J. (2017). Study of Engineered Features and Learning Features in Machine Learning - A Case Study in Document Classification. In: Basu, A., Das, S., Horain, P., Bhattacharya, S. (eds) Intelligent Human Computer Interaction. IHCI 2016. Lecture Notes in Computer Science(), vol 10127. Springer, Cham. https://doi.org/10.1007/978-3-319-52503-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-52503-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52502-0
Online ISBN: 978-3-319-52503-7
eBook Packages: Computer ScienceComputer Science (R0)