Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.3115/1034678.1034679dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Free access

Untangling text data mining

Published: 20 June 1999 Publication History


The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or have not made use of text directly to discover heretofore unknown information.
In this paper I will first define data mining, information access, and corpus-based computational linguistics, and then discuss the relationship of these to text data mining. The intent behind these contrasts is to draw attention to exciting new kinds of problems for computational linguists. I describe examples of what I consider to be real text data mining efforts and briefly outline recent ideas about how to pursue exploratory data analysis over text.


J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. 1998. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218.
Robert St. Amant. 1996. A Mixed-Initiative Planning Approach to Exploratory Data Analysis. Ph.D. thesis, University of Massachusetts, Amherst.
Susan Armstrong, editor. 1994. Using Large Corpora. MIT Press.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Company.
Marcia J. Bates. 1990. The berry-picking search: User interface design. In Harold Thimbleby, editor, User Interface Design. Addison-Wesley.
Douglas Beeferman. 1998. Lexical discovery with an enriched semantic network. In Proceedings of the ACL/COLING Workshop on Applications of WordNet in Natural Language Processing Systems, pages 358--364.
R. J. Brachman, P. G. Selfridge, L. G. Terveen, B. Altman, A Borgida, F. Halper, T. Kirk, A. Lazar, D. L. McGuinness, and L. A. Resnick. 1993. Integrated support for data archaeology. International Journal of Intelligent and Cooperative Information Systems, 2(2): 159--185.
William J. Broad. 1997. Study finds public science is pillar of industry. In The New York Times, May 13.
Matthew Chalmers and Paul Chitson. 1992. Bead: Exploration in information visualization. In Proceedings of the 15th Annual International ACM/SIGIR Conference, pages 330--337, Copenhagen, Denmark.
Hsinchen Chen, Andrea L. Houston, Robin R. Sewell, and Bruce R. Schatz. 1998. Internet browsing and searching: User evaluations of category map and concept space techniques. Journal of the American Society for Information Sciences (JASIS), 49(7).
Kenneth W. Church and Mark Y. Liberman. 1991. A status report on the ACL/DCI. In The Proceedings of the 7th Annual Conference of the UW Centre for the New OED and Text Research: Using Corpora, pages 84--91, Oxford.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. 1998. Learning to extract symbolic knowledge from the world wide web. In Proceedings of AAAI.
Douglass R. Cutting, Jan O. Pedersen, David Karger, and John W. Tukey. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM/SIGIR Conference, pages 318--329, Copenhagen, Denmark.
Ido Dagan, Ronen Feldman, and Haym Hirsh. 1996. Keyword-based browsing and analysis of large document sets. In Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV.
Mark Derthick, John Kolojejchick, and Steven F. Roth. 1997. An interactive visualization environment for data exploration. In Proceedings of the Third Annual Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach.
Usama Fayyad and Ramasamy Uthurusamy. 1999. Data mining and knowledge discovery in databases: Introduction to the special issue. Communications of the ACM, 39(11), November.
Usama Fayyad. 1997. Editorial. Data Mining and Knowledge Discovery, 1(1).
Ronen Feldman and Ido Dagan. 1995. KDT - Knowledge discovery in texts. In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal.
Ronen Feldman, Will Klosgen, and Amir Zilberstein. 1997. Visualization techniques to explore data mining results for document collections. In Proceedings of the Third Annual Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach.
Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.
Marti A. Hearst. 1998. Automated discovery of wordnet relations. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.
David G. Hendry and David J. Harper. 1997. An informal information-seeking environment. Journal of the American Society for Information Science, 48(11): 1036--1048.
David C. Hoaglin, Frederick Mosteller, and John W. Tukey. 1983. Understanding Robust and Exploratory Data Analysis. John Wiley & Sons, Inc.
Jon Kleinberg. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms.
Ray R. Larson. 1996. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In ASIS '96: Proceedings of the 1996 Annual ASIS Meeting.
Xia Lin, Dagobert Soergel, and Gary Marchionini. 1991. A self-organizing semantic map for information retrieval. In Proceedings of the 14th Annual International ACM/SIGIR Conference, pages 262--269, Chicago.
Christopher D. Manning. 1993. Automatic acquisition of a large subcategorization dictionary from corpora. In Proceedings of the 31st Annual Meeting of the Association for Computational Lingusitics, pages 235--242, Columbus, OH.
Francis Narin, Kimberly S. Hamilton, and Dominic Olivastro. 1997. The increasing linkage between us technology and public science. Research Policy, 26(3): 317--330.
Helen J. Peat and Peter Willett. 1991. The limitations of term co-occurence data for query expansion in document retrieval systems. JASIS, 42(5): 378--383.
N. M. Ramadan, H. Halvorson, A. Vandelinde, and S. R. Levine. 1989. Low brain magnesium in migraine. Headache, 29(7): 416--419.
Earl Rennison. 1994. Galaxy of news: An approach to visualizing and understanding expansive news landscapes. In Proceedings of UIST 94, ACM Symposium on User Interface Software and Technology, pages 3--12, New York.
Steven F. Roth, Mei C. Chuah, Stephan Kerpedjiev, John A. Kolojejchick, and Peter Lucas. 1997. Towards an information visualization workspace: Combining multiple means of expression. Human-Computer Interaction, 12(1--2): 131--185.
Don R. Swanson and N. R. Smalheiser. 1994. Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neuroscience Research Communications, 15: 1--9.
Don R. Swanson and N. R. Smalheiser. 1997. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence, 91: 183--203.
Don R. Swanson. 1987. Two medical literatures that are logically but not bibliographically connected. JASIS, 38(4): 228--233.
Don R. Swanson. 1991. Complementary structures in disjoint science literatures. In Proceedings of the 14th Annual International ACM/SIGIR Conference, pages 280--289.
John W. Tukey. 1977. Exploratory Data Analysis. Addison-Wesley Publishing Company.
Ellen M. Voorhees. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM/SIGIR Conference, pages 61--69, Dublin, Ireland.
Michael G. Walker, Walter Volkmuth, Einat Sprinzak, David Hodgson, and Tod Klingler. 1998. Prostate cancer genes identified by genome-scale expression analysis. Technical Report (unnumbered), Incyte Pharmaceuticals, July.
H. D. White and K. W. McCain. 1989. Bibliometrics. Annual Review of Information Science and Technology, 24: 119--186.
James A. Wise, James J. Thomas, Kelly Pennock, David Lantrip, Marc Pottier, and Anne Schur. 1995. Visualizing the non-visual: Spatial analysis and interaction with information from text documents. In Proceedings of the Information Visualization Symposium 95, pages 51--58. IEEE Computer Society Press.
J. Xu and W. B. Croft. 1996. Query expansion using local and global document analysis. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4--11, Zurich.

Cited By

View all
  • (2022)Innovation of Tin Oxide Ceramic Manufacturing Process Based on WSN and Remote Visualization TechnologyComputational Intelligence and Neuroscience10.1155/2022/41514212022Online publication date: 1-Jan-2022
  • (2022)“Do not deceive me anymore!” interpretation through model design and visualization for instagram counterfeit seller account detectionComputers in Human Behavior10.1016/j.chb.2022.107418137:COnline publication date: 11-Oct-2022
  • (2022)A textual analysis of the US Securities and Exchange Commission's accounting and auditing enforcement releases relating to the Sarbanes–Oxley ActInternational Journal of Intelligent Systems in Accounting and Finance Management10.1002/isaf.150629:1(19-40)Online publication date: 25-Apr-2022
  • Show More Cited By



Information & Contributors


Published In

cover image DL Hosted proceedings
ACL '99: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
June 1999
642 pages


Association for Computational Linguistics

United States

Publication History

Published: 20 June 1999


  • Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)176
  • Downloads (Last 6 weeks)26
Reflects downloads up to 17 Feb 2025

Other Metrics


Cited By

View all
  • (2022)Innovation of Tin Oxide Ceramic Manufacturing Process Based on WSN and Remote Visualization TechnologyComputational Intelligence and Neuroscience10.1155/2022/41514212022Online publication date: 1-Jan-2022
  • (2022)“Do not deceive me anymore!” interpretation through model design and visualization for instagram counterfeit seller account detectionComputers in Human Behavior10.1016/j.chb.2022.107418137:COnline publication date: 11-Oct-2022
  • (2022)A textual analysis of the US Securities and Exchange Commission's accounting and auditing enforcement releases relating to the Sarbanes–Oxley ActInternational Journal of Intelligent Systems in Accounting and Finance Management10.1002/isaf.150629:1(19-40)Online publication date: 25-Apr-2022
  • (2021)Building an Internet-Based Knowledge Ontology for Trademark ProtectionJournal of Global Information Management10.4018/JGIM.202101010729:1(123-144)Online publication date: 1-Jan-2021
  • (2019)Computing the semantic similarity between documents by the copula-based econometric modelsProceedings of the 2nd International Conference on Artificial Intelligence and Pattern Recognition10.1145/3357254.3357277(134-139)Online publication date: 16-Aug-2019
  • (2018)Data Mining Problems Classification and TechniquesInternational Journal of Big Data and Analytics in Healthcare10.4018/IJBDAH.20180101043:1(38-57)Online publication date: 1-Jan-2018
  • (2017)Computational methods for text mining user posts on a popular gaming forum for identifying user experience issuesProceedings of the 31st British Computer Society Human Computer Interaction Conference10.14236/ewic/HCI2017.100(1-7)Online publication date: 3-Jul-2017
  • (2017)Efficient Clustering from Distributions over TopicsProceedings of the 9th Knowledge Capture Conference10.1145/3148011.3148019(1-8)Online publication date: 4-Dec-2017
  • (2017)Digging text vizProceedings of the 35th ACM International Conference on the Design of Communication10.1145/3121113.3121221(1-13)Online publication date: 11-Aug-2017
  • (2017)DrugSemanticsJournal of Biomedical Informatics10.1016/j.jbi.2017.06.01372:C(8-22)Online publication date: 1-Aug-2017
  • Show More Cited By

View Options

View options


View or Download as a PDF file.



View online with eReader.


Login options






Share this Publication link

Share on social media