Efficient Term Set Prediction Using the Bell-Wigner Inequality

Melucci, Massimo

doi:10.1007/978-3-319-23826-5_5

Massimo Melucci¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1108 Accesses

Abstract

The task of measuring the dependence between terms is computationally expensive for IR systems which have to deal with large and sparse datasets. The current approaches to mining frequent term sets are based on the enumeration of the term sets found in a set of documents and on monotonicity, the latter being the property that a term set is frequent only if all its subsets are frequent as implemented by Apriori. However, the computational time can be very large. An alternative approach is to store the dataset in a FPT and to visit and prune the tree in a recursive way as implemented by FPGrowth. However, the storage space can still be very large. We introduce the BWI as a conceptual enhancement of monotonicity to predict with certainty when an itemset is frequent and when it is infrequent. We describe the empirical validation that the BWI can significantly reduce both the computational time of Apriori and the storage space of pattern tree-based algorithms such as FPGrowth. The empirical validation has been performed using some runs produced by IR systems from the TIPSTER test collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Fast and Memory-Efficient TFIDF Calculation for Text Analysis of Large Datasets

Pre-indexing Pruning Strategies

Progressive Term Frequency Analysis on Large Text Collections

References

Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson International Edition (2006)
Google Scholar
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD, Washington, D.C., pp. 207–216 (1993)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of SIGMOD, pp. 1–12 (2000)
Google Scholar
Pitowsky, I.: Correlation polytopes: Their geometry and complexity. Mathematical Programming 50, 395–414 (1991)
Article MathSciNet MATH Google Scholar
Pitowsky, I.: Quantum Probability - Quantum Logic. Springer (1989)
Google Scholar
Blanco, R., Boldi, P.: Extending BM25 with multiple query operators. In: Proceedings of SIGIR, pp. 921–930 (2012)
Google Scholar
Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery 15, 55–86 (2007)
Article MathSciNet Google Scholar
Kirsch, A., Mitzenmacher, M., Pietracaprina, A., Pucci, G., Upfal, E., Vandin, F.: An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM 59(3) (2012)
Google Scholar
Wang, K., He, Y., Han, J.: Mining frequent itemsets using support constraints. In: Proceedings of VLDB (2000)
Google Scholar
Burdick, D., Calimlim, M., Flannick, J., Gehrke, J., Yiu, T.: MAFIA: A maximal frequent itemset algorithm. IEEE Transactions on Knowledge and Data Engineering 11, 1490–1504 (2005)
Article Google Scholar
Gouda, K., Zaki, M.J.: Efficiently mining maximal frequent itemsets. In: Proceedings of ICDM (2001)
Google Scholar
Zheng, Z., Kohavi, R., Mason, L.: Real world performance of association rule algorithms. In: Proceedings of KDD, pp. 401–406. ACM New York (2001)
Google Scholar
Liu, J., Pan, Y., Wang, K., Han, J.: Mining frequent item sets by opportunistic projection. In: Proceedings of KDD, pp. 229–238. ACM, New York (2002)
Google Scholar
Pei, J., Han, J., Lu, H., Nishio, S., Tang, S., Yang, D.: H-mine: Hyper-structure mining of frequent patterns in large databases. In: Proceedings of ICDM, pp. 441–448. IEEE Computer Society, Washington, DC (2001)
Google Scholar
Pietracaprina, A., Zandolin, D.: Mining frequent itemsets using patricia tries. In: Goethals, B., Zaki, M.J. (eds.) FIMI. CEUR Workshop Proceedings, vol. 90. CEUR-WS.org (2003)
Google Scholar
Schlegel, B., Gemulla, R., Lehner, W.L.W.: Memory-efficient frequent-itemset mining. In: Proceedings of EDBT, pp. 461–472 (2011)
Google Scholar
Pôssas, B., Ziviani, N., Meira Jr, W., Ribeiro-Neto, B.: Set-based vector model: An efficient approach for correlation-based ranking. ACM Trans. Inf. Syst. 23(4), 397–429 (2005)
Article Google Scholar
Amir, A., Aumann, Y., Feldman, R., Fresko, M.: Maximal association rules: A tool for mining associations in text. J. Intell. Inf. Syst. 25(3), 333–345 (2005)
Article Google Scholar
Sarawagi, S., Thomas, S., Agrawal, R.: Integrating association rule mining with relational database systems: Alternatives and implications. Data Min. Knowl. Discov. 4(2–3), 89–125 (2000)
Article Google Scholar
Fonseca, B.M., Golgher, P., Pôssas, B., Ribeiro-Neto, B., Ziviani, N.: Concept-based interactive query expansion. In: Proceedings of CIKM, CIKM 2005, pp. 696–703. ACM, New York (2005)
Google Scholar
Fonseca, B.M., Golgher, P.B., De Moura, E.S., Pôssas, B., Ziviani, N.: Discovering search engine related queries using association rules. J. Web Eng. 2(4), 215–227 (2003)
Google Scholar
Song, D., Huang, Q., Rüger, S.M., Bruza, P.D.: Facilitating Query Decomposition in Query Language Modeling by Association Rule Mining Using Multiple Sliding Windows. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 334–345. Springer, Heidelberg (2008)
Chapter Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (2006)
Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2009)
Article Google Scholar
Keyword Discovery. http://www.keyworddiscovery.com/keyword-stats.html (visited on April 2014)
Bendersky, M., Croft, W.B.: Analysis of long queries in a large scale search log. In: Proceedings of the Workshop on Web Search Click Data, WSCD 2009, pp. 8–14. ACM, New York (2009)
Google Scholar
Gan, Q., Attenberg, J., Markowetz, A., Suel, T.: Analysis of geographic queries in a search engine log. In: Proceedings of the International Workshop on Location and the Web, LOCWEB 2008, pp. 49–56. ACM New York (2008)
Google Scholar
Jansen, B.J., Spink, A.: How are we searching the world wide web?: a comparison of nine search engine transaction logs. Inf. Process. Manage. 42, 248–263 (2006)
Article Google Scholar
Jansen, B.J., Booth, D.L., Spink, A.: Determining the user intent of Web search engine queries. In: Proceedings of WWW, pp. 1149–1150. ACM, New York (2007)
Google Scholar
Jansen, B.J., Booth, D.L., Spink, A.: Determining the informational, navigational, and transactional intent of Web queries. Inf. Process. Manage. 44, 1251–1266 (2008)
Article Google Scholar
Jansen, B.J., Booth, D.L., Spink, A.: Patterns of query reformulation during Web searching. Journal of the American Society for Information Science and Technology 60, 1358–1371 (2009)
Article Google Scholar
Huston, S., Croft, W.B.: Evaluating verbose query processing techniques. In: Proceedings of SIGIR, pp. 291–298. ACM, New York (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padua, Padua, Italy
Massimo Melucci

Authors

Massimo Melucci
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Massimo Melucci .

Editor information

Editors and Affiliations

King's College London, London, United Kingdom
Costas Iliopoulos
University of Helsinki, Helsinki, Finland
Simon Puglisi
University College London, London, United Kingdom
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Melucci, M. (2015). Efficient Term Set Prediction Using the Bell-Wigner Inequality. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-23826-5_5
Published: 05 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Term Set Prediction Using the Bell-Wigner Inequality

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Fast and Memory-Efficient TFIDF Calculation for Text Analysis of Large Datasets

Pre-indexing Pruning Strategies

Progressive Term Frequency Analysis on Large Text Collections

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Term Set Prediction Using the Bell-Wigner Inequality

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Fast and Memory-Efficient TFIDF Calculation for Text Analysis of Large Datasets

Pre-indexing Pruning Strategies

Progressive Term Frequency Analysis on Large Text Collections

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation