Document Representation Based on Maximal Frequent Sequence Sets

Hernández-Reyes, Edith; Martínez-Trinidad, J. Fco.; Carrasco-Ochoa, J. A.; García-Hernández, René A.

doi:10.1007/11892755_88

Edith Hernández-Reyes¹⁹,
J. Fco. Martínez-Trinidad¹⁹,
J. A. Carrasco-Ochoa¹⁹ &
…
René A. García-Hernández¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4225))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1210 Accesses

Abstract

In document clustering, documents are commonly represented through the vector space model as a word vector where the features correspond to the words of the documents. However, there are a lot of words in a document set; therefore the vector size could be enormous. Also, the vector space model does not take into account the word order that could be useful to group similar documents. In order to reduce these disadvantages, we propose a new document representation in which each document is represented as a set of its maximal frequent sequences. The proposed document representation is applied for document clustering and the quality of the clustering is evaluated through internal and external measures, the results are compared with those obtained with the vector space model.

Download to read the full chapter text

Chapter PDF

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

Extended Strategies for Document Clustering with Word Co-occurrences

Discovering Patterns Using Feature Selection Techniques and Correlation

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Su, Z., Zhang, L., Pan, Y.: Document Clustering Based on Vector Quatization and Growing-Cell Structure. In: Chung, P.W.H., Hinde, C.J., Ali, M. (eds.) IEA/AIE 2003. LNCS, vol. 2718, pp. 326–336. Springer, Heidelberg (2003)
Chapter Google Scholar
Yoelle, S., Fagin, Ronald, Ben-Shaul, Israel Z. y Pelleg, Dan. Ephemeral Document Clustering for Web Applications. IBM Research. Report RJ 10186 (2000)
Google Scholar
Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Information Retrieval. Journal of the American Society for information Science, 613–620 (1975)
Google Scholar
http://www.ics.uci.edu/~kdd/databases/reuters21578/reuters21578.html
Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proc. of the ICML 1999 Workshop on Machine Learning in Text Data Analysis, pp. 11–17 (1999)
Google Scholar
Daucet, A.: Advanced Document Description, a Sequential Approach. Thesis PhD. University of Helsinki Finland (2005)
Google Scholar
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A Fast Algorithm to Find All the Maximal Frequent Sequences. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 478–486. Springer, Heidelberg (2004)
Chapter Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. Text mining workshop, KDD (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute for Astrophysics, Optics and Electronics, Luis Enrique Erro No.1 Sta. Ma. Tonantzintla, Puebla, México
Edith Hernández-Reyes, J. Fco. Martínez-Trinidad, J. A. Carrasco-Ochoa & René A. García-Hernández

Authors

Edith Hernández-Reyes
View author publications
You can also search for this author in PubMed Google Scholar
J. Fco. Martínez-Trinidad
View author publications
You can also search for this author in PubMed Google Scholar
J. A. Carrasco-Ochoa
View author publications
You can also search for this author in PubMed Google Scholar
René A. García-Hernández
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department,, José Francisco Martínez-Trinidad, National Institute of Astrophysics, Optics and Electronics (INAOE), Luis Enrique Erro No. 1, 72840 Sta. Maria Tonantzintla, Puebla, Mexico
José Francisco Martínez-Trinidad
Computer Science Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Luis Enrique Erro No. 1, 72840 Sta. Maria Tonantzintla, Puebla, Mexico
Jesús Ariel Carrasco Ochoa
Centre for Vision, Speech and Signal Processing, University of Surrey,, GU2 7XH, Guildford, UK
Josef Kittler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hernández-Reyes, E., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., García-Hernández, R.A. (2006). Document Representation Based on Maximal Frequent Sequence Sets. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds) Progress in Pattern Recognition, Image Analysis and Applications. CIARP 2006. Lecture Notes in Computer Science, vol 4225. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892755_88

Download citation

DOI: https://doi.org/10.1007/11892755_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46556-0
Online ISBN: 978-3-540-46557-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Document Representation Based on Maximal Frequent Sequence Sets

Abstract

Chapter PDF

Similar content being viewed by others

Combining semantic and term frequency similarities for text clustering

Extended Strategies for Document Clustering with Word Co-occurrences

Discovering Patterns Using Feature Selection Techniques and Correlation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Document Representation Based on Maximal Frequent Sequence Sets

Abstract

Chapter PDF

Similar content being viewed by others

Combining semantic and term frequency similarities for text clustering

Extended Strategies for Document Clustering with Word Co-occurrences

Discovering Patterns Using Feature Selection Techniques and Correlation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation