Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Efficient discovery of optimal word-association patterns in large text databases

  • Special Feature
  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

We study efficient discovery of proximity word-association patterns, defined by a sequence of strings and a proximity gap, from a collection of texts with the positive and the negative labels. We present an algorithm that finds alld-stringsk-proximity word-association patterns that maximize the number of texts whose matching agree with their labels. It runs in expected time complexityO(k d−1n logd n) and spaceO(k d−1n) with the total lengthn of texts, if texts are uniformly random strings. We also show that the problem to find one of the best word-association patterns with arbitrarily many strings in MAX SNP-hard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Agrawal, R., Imielinski, T. and Swami, A., “Mining Association Rules between Sets of Items in Large Databases,” inProc. 1993 SIGMOD, pp. 207–216, 1993.

  2. Ausiello, G., Crescenzi, P. and Protasi, M., “Approximate Solution of NP Optimization Problems,”Theor. Comput. Sc., 150, pp. 1–55, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  3. Arimura, H., Wataki, A., Fujino, R. and Arikawa, S., “A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases,” inProc. the 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501, pp. 247–261, 1998.

  4. Arimura, H., Kasai, T., Wataki, A., Fujino, R., Shimozono, S. and Arikawa, S., “An Efficient Tool for Discovering Simple Combinatorial Patterns from Large Text Databases,” inProc. the 1st Discovery Science, LNAI 1532, pp. 393–394, 1998.

  5. Devroye, L., Szpankowski, W. and Rais, B., “A Note on the Height of the Suffix Trees,”SIAM J. Comput.,21, pp. 48–53, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  6. Dobkin, D. P. Gunopulos, D. and Maass, W., “Computing the Maximum Bichromatic Discrepancy, with Applications to Computer Graphics and Machine Learning,”J. Comut. Sys. Sci., 52, pp. 453–470, 1996.

    Article  MATH  MathSciNet  Google Scholar 

  7. Fukuda, T. Morimoto, Y., Morishita, S. and Tokuyama, T., “Data Mining Using Two-Dimensional Optimized Association Rules,” inProc. 1996 SIGMOD, pp. 13–23, 1996.

  8. Johnson, D. S., “Approximation Algorithms for Combinatorial Problems,”J. Comut. Sys. Sci., 9, pp. 256–278, 1974.

    Article  MATH  Google Scholar 

  9. Kearns, M. J., Shapire, R. E. and Sellie, L. M., “Toward Efficient Agnostic Learning,”Machine Learning, 17, pp. 115–141, 1994.

    MATH  Google Scholar 

  10. Manber, U. and Baeza-Yates, R., “An Algorithm for String Matching with a Sequence of Don’t Cares,”Inf. Procc. Lett., 37, pp. 133–136, 1991.

    Article  MATH  MathSciNet  Google Scholar 

  11. McCreight, E. M., “A Space-Economical Suffix Tree Construction Algorithm,”J. ACM, 23, pp. 262–272, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  12. Miyano, S., Shinohara, A. and Shinohara, T., “Which Classes of Elementary Formal Systems are Polynomial-Time Learnable,” inProc. 2nd Workshop on Algorithmic Learning Theory, pp. 139–150, 1991.

  13. Papadimitriou, C. H. and Yannakakis, M., “Optimization, Approximation, and Complexity Classes,”J. Comput. Sys. Sci., 43, pp. 425–440, 1991.

    Article  MATH  MathSciNet  Google Scholar 

  14. Wang, J. T.-L., Chirn, G.-W., Marr, T. G., Shapiro, B., Shasha, D. and Zhang., K., “Combinatorial Pattern Discovery for Scientific Data: Some preliminary results,” inProc. 1994 SIGMOD, pp. 115–125, 1994.

Download references

Author information

Authors and Affiliations

Authors

Additional information

Shinichi Shimozono, Ph.D.: He is an Associate Professor of the Department of Artificial Intelligence at Kyushu Institute of Technology Iizuka, Japan. He obtained the B.S. degree in Physics from Kyushu University, awarded M.S. degree from Graduate School of Information Science in Kyushu University, and his Dr. Sci. degree in 1996 from Kyushu University. His research interests are primarily in the design and analysis of algorithms for intractable problems.

Hiroki Arimura, Ph.D.: He is an Associate Professor of the Department of Informatics at Kyushu University, Fukuoka, Japan. He is also a researcher with Precursory Research for Embryonic Science and Technology, Japan Science and Technology Corporation (JST) since 1999. He received the B.S. degree in 1988 in Physics, the M.S. degree in 1979 and the Dr.Sci. degree in 1994 in Information Systems from Kyushu University. His research interests include data mining, computational learning theory, and inductive logic programming.

Setsuo Arikawa, Ph.D.: He is a Professor of the Department of Informatics and the Director of University Library at Kyushu University, Fukuoka, Japan. He received the B.S. degree in 1964, the M.S. degree in 1966 and the Dr.Sci. degree in 1969 all in Mathematics from Kyushu University. His research interests include Discovery Science, Algorithmic Learning Theory, Logic and Inference/Reasoning in AI, Pattern Matching Algorithms and Library Science. He is the principal investigator of the Discovery Science Project sponsored by the Grant-in Aid for Scientific Research on Priority Area from the Ministry of ESSC, Japan.

About this article

Cite this article

Shimozono, S., Arimura, H. & Arikawa, S. Efficient discovery of optimal word-association patterns in large text databases. New Gener Comput 18, 49–60 (2000). https://doi.org/10.1007/BF03037568

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03037568

Keywords