Article

Mining Text Data: Special Features and Patterns

Authors:

Miguel Delgado,

Maria J. Martín-Bautista,

Daniel Sánchez,

María Amparo Vila MirandaAuthors Info & Claims

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery

Pages 140 - 153

Published: 16 September 2002 Publication History

Abstract

Text mining is an increasingly important research field because of the necessity of obtaining knowledge from the enormous number of text documents available, especially on the Web. Text mining and data mining, both included in the field of information mining, are similar in some sense, and thus it may seem that data mining techniques may be adapted in a straightforward way to mine text. However, data mining deals with structured data, whereas text presents special characteristics and is basically unstructured. In this context, the aims of this paper are three: - To study particular features of text. - To identify the patterns we may look for in text. - To discuss the tools we may use for that purpose.In relation with the third point we overview existing proposals, as well as some new tools we are developing by adapting data mining tools previously developed by our research group.

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. Of the 1993 ACM SIGMOD Conference , pages 207-216, 1993.

[2]

R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th Int. Conf. On Data Engineering , pages 3-14, 1995.

[3]

H. Ahonen, O. Heinonen, M. Klemettinen, and A. Inkeri-Verkamo. Applying data mining techniques in text analysis. Technical Report C-1997-23, Department of Computer Science, University of Helsinki, 1997.

[4]

H. Ahonen-Myka. Finding all frequent maximal sequences in text. In D. Mladenic and M. Grobelnik, editors, Proc. 16th Int. Conf. On Machine Learning ICML-99 Workshopon Machine Learning in TExt DAta Analysis , pages 11-17, 1999.

[5]

H. Ahonen-Myka, O. Heinonen, M. Klemettinen, and A. Inkeri-Verkamo. Finding co-occurring text phrases by combining sequence and frequent set discovery. In R. Feldman, editor, Proc. 16th Int. Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications , pages 1-9, 1999.

[6]

J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proc. 21st Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval , 1998.

[7]

F. Berzal, I. Blanco, D. Sánchez, and M.A. Vila. A new framework to assess association rules. In F. Hoffmann, D.J. Hand, N. Adams, D. Fisher, and G. Guimaraes, editors, Advances in Intelligent Data Analysis. Fourth International Symposium, IDA'01. Lecture Notes in Computer Science 2189 , pages 95-104. Springer-Verlag, 2001.

[8]

F. Berzal, I. Blanco, D. Sánchez, and M.A. Vila. Measuring the accuracy and interest of association rules: A new framework. An extension of {7}. Intelligent Data Analysis, submitted, 2002.

[9]

I. Blanco, M.J. Martín-Bautista, D. Sánchez, and M.A. Vila. On the support of dependencies in relational databases: strong approximate dependencies. Data Mining and Knowledge Discovery, Submitted, 2000.

[10]

G. Bordogna, P. Carrara, and G. Pasi. Fuzzy approaches to extend boolean information retrieval. In P. Bosc and J. Kacprzyk, editors, Fuzziness in Database Management Systems , pages 231-274. Physica-Verlag, 1995.

[11]

S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD Record , 26(2):255-264, 1997.

[12]

M. Delgado, N. Marín, D. Sánchez, and M.A. Vila. Fuzzy association rules: General model and applications. IEEE Transactions on Fuzzy Systems , 2001. Submitted.

[13]

M. Delgado, M.J. Martín-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila. Association rules extraction for text mining. FQAS'2002, Submitted, 2002.

[14]

M. Delgado, M.J. Martín-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila. Web mining via fuzzy association rules. NAFIPS'2002, Submitted, 2002.

[15]

M. Delgado, M.J. Martín-Bautista, D. Sánchez, and M.A. Vila. Mining strong approximate dependencies from relational databases. In Proceedings of IPMU'2000 , 2000.

[16]

M. Delgado, D. Sánchez, and M.A. Vila. Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning , 23:23-66, 2000.

[17]

R. Feldman and I. Dagan. Knowledge discovery in textual databases (KDT). In Proceedings of the 1st Int. Conference on Knowledge Discovery and Data Mining (KDD-95) , pages 112-117. AAAI Press, 1995.

[18]

R. Feldman, I. Dagan, and W. Kloegsen. Efficient algorithm for mining and manipulating associations in texts. In Proc. 13th European Meeting on Cybernetics and Research , 1996.

[19]

R. Feldman, M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler, and O. Zamir. Text mining at the term level. In Proc. 2nd European Symposium on Principles of Data Mining and Knowledge Discovery , pages 65-73, 1998.

[20]

R. Feldman and H. Hirsh. Mining associations in text in presence of background knowledge. In Proc 2nd Int. Conf. On Knowledge Discovery and Data Mining, KDD'96 , pages 343-346, 1996.

[21]

M.A. Hearst. Untangling text data mining. In Proceedings of the 37 Annual Meeting of the Association for Computational Linguistics , pages 20-26, 1999.

[22]

H. Karanikas, C. Tjortjis, and B. Theodoulidis. An approach to text mining using information extraction. In Proc. Knowledge Management Theory Applications Workshop, (KMTA 2000) , 2000.

[23]

Y. Kodratoff. Comparing machine learning and knowledge discovery in DataBases: An application to knowledge discovery in texts. In G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, editors, Machine Learning and Its Applications, Advanced Lectures. Lecture Notes in Computer Science Series 2049 , pages 1-21. Springer, 2001.

[24]

D.H. Kraft and D.A. Buell. Fuzzy sets and generalized boolean retrieval systems. In D. Dubois and H. Prade, editors, Readings in Fuzzy Sets for Intelligent Systems , pages 648-659. Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[25]

B. Lent, R. Agrawal, and R. Srikant. Discovering trends in text databases. In Proc. 3rd Int. Conference on Knowledge Discovery and Data Mining (KDD-97) , pages 227-230, 1997.

[26]

S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, and Y.-M. Huang. Extracting classification knowledge of internet documents with mining term associations: A semantic approach. In Proc. ACM/SIGIR'98 , pages 241-249, 1998.

[27]

A. Maedche and S. Staab. Mining ontologies from text. In Proc. 12th International Workshopon Knowledge Engineering and Knowledge Management (EKAW'2000) , pages 189-202, 2000.

[28]

H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Proc. 2nd Int. Conf on Knowledge Discovery and Data Mining (KDD'96) , pages 146-151, 1996.

[29]

C.D. Manning. Automatic acquisition of a large subcategorization dictionary from corpora. In Proc. 31st Annual Meeting of the Association for Computational Linguistics , pages 235-242, 1993.

[30]

D. Mladenic. Feature subset selection in text-learning. In Proc. 10th European Conference on Machine Learning ECML98 , 1998.

[31]

D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98 , 1998.

[32]

U.Y. Nahm and R.J. Mooney. Using information extraction to aid the discovery of prediction rules from text. In Proceedings 6th Int. Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshopon Text Mining , pages 51-58, 2000.

[33]

U.Y. Nahm and R.J. Mooney. Mining soft-matching rules from textual data. In Proc. 7th Int. Joint Conference on Artificial Intelligence (IJCAI-01) , 2001.

[34]

H.J. Peat and P. Willett. The limitations of term co-occurence data for query expansion in document retrieval systems. JASIS , 42(5):378-383, 1991.

[35]

G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. Frawley, editors, Knowledge Discovery in Databases , pages 229-238. AAAI/MIT Press, 1991.

[36]

M. Rajman and R. Besançon. Text mining: Natural language techniques and text mining applications. In Proc. Of the 7th IFIP Working Conference on Database Semantics (DS-7) . Chapam & Hall, 1997.

[37]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management , 24(5):513-523, 1988.

[38]

R.C. Schank. Identification of conceptualizations underlying natural language. In R.C. Schank & K.M. Colby, editor, Compputer Models of Thought and Language . Freeman, San Francisco, 1973.

[39]

R.C. Schank. Language and memory. Cognitive Science , 4, 1980.

[40]

E. Shortliffe and B. Buchanan. A model of inexact reasoning in medicine. Mathematical Biosciences , 23:351-379, 1975.

[41]

C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery , 2:39-68, 1998.

[42]

R. Srikant and R. Agrawal. Mining generalized association rules. In Proc 21th Int'l Conf. Very Large Data Bases , pages 407-419, September 1995.

[43]

Ah-Hwee Tan. Text mining: The state of the art and the challenges. In Proceedings PAKDD'99 Workshopon Knowledge Discovery from Advanced Databases (KDAD'99) , pages 71-76, 1999.

[44]

E.M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the 17h Annual Int. ACM/SIGIR Conference , pages 61-69, 1994.

[45]

W. Wang, J. Yang, and P.S. Yu. Efficient mining of weighted association rules. In Proc. Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2000.

[46]

K. Winkler and M. Spiliopoulou. Extraction of semantic XML DTDs from texts using data mining techniques. In Proc. K-CAP 2001 WorkshopKnow ledge Markup & Semantic Annotation , 2001.

[47]

J. Xu and W.B. Croft. Query expansion using local and global document analysis. In Proceedings 19th Annual Int. ACM/SIGIR Conference on REsearch and Development in Information Retrieval , pages 4-11, 1996.

[48]

Y. Yang, T. Pierce, and J. Carbonell. A study on restrospective and online event detection. In Proc. 21st Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval , pages 28-36, 1998.

[49]

L. A. Zadeh. A computational approach to fuzzy quantifiers in natural languages. Computing and Mathematics with Applications , 9(1):149-184, 1983.

Cited By

Durmuşoğlu ZÇiftçi PBayat OAljawarneh S(2018)The Evolution of the Industry 4.0Proceedings of the Fourth International Conference on Engineering & MIS 201810.1145/3234698.3234757(1-5)Online publication date: 19-Jun-2018
https://dl.acm.org/doi/10.1145/3234698.3234757
Bouakkaz MOuinten YLoudcher SFournier-Viger P(2018)Efficiently mining frequent itemsets applied for textual aggregationApplied Intelligence10.1007/s10489-017-1050-948:4(1013-1019)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10489-017-1050-9
Goh YGiess MMcMahon C(2009)Facilitating design learning through faceted classification of in-service informationAdvanced Engineering Informatics10.1016/j.aei.2009.05.00323:4(497-511)Online publication date: 1-Oct-2009
https://dl.acm.org/doi/10.1016/j.aei.2009.05.003
Show More Cited By

Mining Text Data: Special Features and Patterns
1. Computing methodologies
2. Information systems
  1. Information systems applications

Recommendations

Mining uncertain data

As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the ...
Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery

September 2002

226 pages

ISBN:3540441484

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 16 September 2002

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Durmuşoğlu ZÇiftçi PBayat OAljawarneh S(2018)The Evolution of the Industry 4.0Proceedings of the Fourth International Conference on Engineering & MIS 201810.1145/3234698.3234757(1-5)Online publication date: 19-Jun-2018
https://dl.acm.org/doi/10.1145/3234698.3234757
Bouakkaz MOuinten YLoudcher SFournier-Viger P(2018)Efficiently mining frequent itemsets applied for textual aggregationApplied Intelligence10.1007/s10489-017-1050-948:4(1013-1019)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10489-017-1050-9
Goh YGiess MMcMahon C(2009)Facilitating design learning through faceted classification of in-service informationAdvanced Engineering Informatics10.1016/j.aei.2009.05.00323:4(497-511)Online publication date: 1-Oct-2009
https://dl.acm.org/doi/10.1016/j.aei.2009.05.003
Escobar-Jeria VMartín-Bautista MSánchez DVila M(2007)Analysis of log files applying mining techniques and fuzzy logicProceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems10.5555/1769938.1769999(483-492)Online publication date: 26-Jun-2007
https://dl.acm.org/doi/10.5555/1769938.1769999
Rybinski HKryszkiewicz MProtaziuk GJakubowski ADelteil A(2007)Discovering Synonyms Based on Frequent TermsetsProceedings of the international conference on Rough Sets and Intelligent Systems Paradigms10.1007/978-3-540-73451-2_54(516-525)Online publication date: 28-Jun-2007
https://dl.acm.org/doi/10.1007/978-3-540-73451-2_54
Escobar-Jeria VMartín-Bautista MSánchez DVila M(2007)Web Usage Mining Via Fuzzy Logic TechniquesProceedings of the 12th international Fuzzy Systems Association world congress on Foundations of Fuzzy Logic and Soft Computing10.1007/978-3-540-72950-1_25(243-252)Online publication date: 18-Jun-2007
https://dl.acm.org/doi/10.1007/978-3-540-72950-1_25

View Options

View options

Figures

Tables

Media

View Table of Conten