article

Discovering all most specific sentences

Authors:

Dimitrios Gunopulos,

Heikki Mannila,

Sanjeev Saluja,

Hannu Toivonen,

Ram Sewak SharmaAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 28, Issue 2

Pages 140 - 174

https://doi.org/10.1145/777943.777945

Published: 01 June 2003 Publication History

Abstract

Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.

References

[1]

Agrawal, R. C., Aggarwal, C. C., and Prasad, V. V. V. 2000. Depth first generation of long patterns. In Knowledge Discovery and Data Mining, pp. 108--118.]]

Digital Library

[2]

Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'93) (May). ACM, New York, pp. 207--216.]]

Digital Library

[3]

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, Calif., pp. 307--328.]]

Digital Library

[4]

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94) (Sept.). pp. 487--499.]]

Digital Library

[5]

Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95) (Taipei, Taiwan, Mar.). pp. 3--14.]]

Digital Library

[6]

Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, (Apr.), 319--342.]]

Digital Library

[7]

Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York.]]

Digital Library

[8]

Bell, S. 2003. Deciding distinctness of query results by discovered constraints. Manuscript.]]

[9]

Bell, S. and Brockhausen, P. 1995. Discovery of data dependencies in relational databases. Tech. Rep. LS-8 14. Universität Dortmund, Fachbereich Informatik, Lehrstuhl VIII, Künstliche Intelligenz.]]

[10]

Berge, C. 1973. Hypergraphs. Combinatorics of Finite Sets, 3rd ed. North-Holland, Amsterdam, The Netherlands.]]

[11]

Bioch, J. and Ibaraki, T. 1995. Complexity of identification and dualization of positive Boolean functions. Inf. Comput. 123, 1, 50--63.]]

Digital Library

[12]

Bshouty, N. H. 1996. The monotone theory for the pac-model. In Proceedings of the ACM Symposium on Theory of Computing (STOC). ACM, New York.]]

[13]

Bshouty, N. H., Cleve, R., Gavalda, R., Kannan, S., and Tamon, C. 1996. Oracles and queries that are sufficient for exact learning. J. Comput. Syst. Sci. 52, 421--433.]]

Digital Library

[14]

Burdick, D., Calimlim, M., and Gehrke, J. 2001. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the International Conference on Data Engineering.]]

Digital Library

[15]

Eiter, T. and Gottlob, G. 1995. Identifying the minimal transversals of a hypergraph and related problems. SIAM J. Comput. 24, 6 (Dec.), 1278--1304.]]

Digital Library

[16]

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. 1996. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, Calif.]]

Digital Library

[17]

Fredman, M. L. and Khachiyan, L. 1996. On the complexity of dualization of monotone disjunctive normal forms. J. Algorithms 21, 3 (Nov.), 618--628.]]

Digital Library

[18]

Garey, M. and Johnson, D. 1979. Computers and Intractability---A Guide to the Theory of NP-Completeness. W. H. Freeman, New York.]]

Digital Library

[19]

Gouda, K. and Zaki, M. J. 2001. Efficiently mining maximal frequent itemsets. In ICDM, pp. 163--170.]]

Digital Library

[20]

Gunopulos, D., Khardon, R., Mannila, H., and Toivonen, H. 1997a. Data mining, hypergraph transversals, and machine learning. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97). ACM, New York.]]

Digital Library

[21]

Gunopulos, D., Mannila, H., and Saluja, S. 1997b. Discovering all most specific sentences by randomized algorithms. In Proceedings of the International Conference on Database Theory (ICDT'97).]]

Digital Library

[22]

Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 1--12.]]

Digital Library

[23]

Kavvadias, D. and Stavropoulos, E. C. 1999. Evaluation of an algorithm for the transversal hypergraph problem. In Algorithm Engineering, pp. 72--84.]]

Digital Library

[24]

Khardon, R. 1995. Translating between Horn representations and their characteristic models. J. Artif. Intel. Res. 3, 349--372.]]

Digital Library

[25]

Knobbe, A. J. and Adriaans, P. W. 1995. Discovering foreign key relations in relational databases. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 94--99.]]

[26]

Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann, San Mateo, Calif.]]

Digital Library

[27]

Lin, D.-I. and Kedem, Z. M. 1998. Pincer search: A new algorithm for discovering the maximum frequent set. In Extending Database Technology, pp. 105--119.]]

Digital Library

[28]

Mannila, H. 1995. Aspects of data mining. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 1--6.]]

[29]

Mannila, H. 1996. Data mining: Machine learning, statistics, and databases. In Proceedings of the 8th International Conference on Scientific and Statistical Database Management (Stockholm, Sweden). pp, 2--9.]]

[30]

Mannila, H. and Räihä, K.-J. 1986. Design by example: An application of Armstrong relations. J. Comput. Syst. Sci. 33, 2, 126--141.]]

Digital Library

[31]

Mannila, H. and Räihä, K.-J. 1992. Design of Relational Databases. Addison-Wesley, Wokingham, U.K.]]

Digital Library

[32]

Mannila, H. and Räihä, K.-J. 1994. Algorithms for inferring functional dependencies. Data Knowl. Eng. 12, 1 (Feb.), 83--99.]]

Digital Library

[33]

Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl. Disc. 1, 3, 241--258.]]

Digital Library

[34]

Mannila, H., Toivonen, H., and Verkamo, A. I. 1994. Efficient algorithms for discovering association rules. In Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD'94) (Seattle, Wash., July), pp. 181--192.]]

[35]

Mannila, H., Toivonen, H., and Verkamo, A. I. 1995. Discovering frequent episodes in sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD'95) (Montreal, Ont., Canada, Aug.). AAAI Press, pp. 210--215.]]

[36]

Mishra, N. and Pitt, L. 1997. Generating all maximal independent sets of bounded-degree hypergraphs. In Proceedings of the Conference on Computational Learning Theory (COLT). pp. 211--217.]]

Digital Library

[37]

Mitchell, T. M. 1982. Generalization as search. Artif. Intel. 18, 203--226.]]

[38]

Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning.]]

[39]

Schlimmer, J. 1993. Using learned dependencies to automatically construct sufficient and sensible editing views. In Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop (KDD'93) (Washington, D.C.). AAAI Press, pp. 186--196.]]

[40]

Ullman, J. D. 1988. Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville, Md.]]

Digital Library

[41]

Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.]]

Digital Library

[42]

Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.]]

Digital Library

Cited By

Islam MAsadi MBasu Roy S(2023)Equitable Top-k Results for Long Tail DataProceedings of the ACM on Management of Data10.1145/36267271:4(1-24)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626727
Zhang ZChen WLink S(2023)Composite Object Normal Forms: Parameterizing Boyce-Codd Normal Form by the Number of Minimal KeysProceedings of the ACM on Management of Data10.1145/35886931:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588693
Amsterdamer YDavidson SMilo TRazmadze KSomech A(2023)Selecting Sub-tables for Data Exploration2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00192(2496-2509)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00192
Show More Cited By

Index Terms

Discovering all most specific sentences
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set
COMPSAC '01: Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development

The association rule mining can be divided into two steps. The first step is to find out all frequent itemsets, whose occurrences are greater than or equal to the user-specified threshold. The second step is to generate reliable association rules based ...
Efficiently mining Maximal Frequent Sets in dense databases for discovering association rules

We present, MaxDomino, an algorithm for mining Maximal Frequent Sets (MFS) for discovering association rules in dense databases. The algorithm uses novel concepts of dominancy factor and collapsibility of transaction for efficiently mining MFS. Unlike ...
Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 28, Issue 2

June 2003

108 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/777943

Issue’s Table of Contents

Copyright © 2003 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2003

Published in TODS Volume 28, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

165
Total Citations
View Citations
1,484
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Islam MAsadi MBasu Roy S(2023)Equitable Top-k Results for Long Tail DataProceedings of the ACM on Management of Data10.1145/36267271:4(1-24)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626727
Zhang ZChen WLink S(2023)Composite Object Normal Forms: Parameterizing Boyce-Codd Normal Form by the Number of Minimal KeysProceedings of the ACM on Management of Data10.1145/35886931:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588693
Amsterdamer YDavidson SMilo TRazmadze KSomech A(2023)Selecting Sub-tables for Data Exploration2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00192(2496-2509)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00192
Toivonen H(2023)Frequent PatternEncyclopedia of Machine Learning and Data Science10.1007/978-1-4899-7502-7_106-1(1-6)Online publication date: 12-Apr-2023
https://doi.org/10.1007/978-1-4899-7502-7_106-1
Demin APonomaryov D(2022)Interpretable Reinforcement Learning with Multilevel Subgoal Discovery2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA55696.2022.00043(251-258)Online publication date: Dec-2022
https://doi.org/10.1109/ICMLA55696.2022.00043
Makhalova TBuzmakov AKuznetsov SNapoli A(2022)Introducing the closure structure and the GDPM algorithm for mining and understanding a tabular datasetInternational Journal of Approximate Reasoning10.1016/j.ijar.2021.12.012145:C(75-90)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1016/j.ijar.2021.12.012
van Leeuwen MUkkonen A(2022)Fast Estimation of the Pattern Frequency SpectrumMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44851-9_8(114-129)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44851-9_8
Li HChen LDemartini GZuccon GCulpepper JHuang ZTong H(2021)Cache-based GNN System for Dynamic GraphsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482237(937-946)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482237
Lei MZhang XChu LWang ZYu PFang B(2021)Finding Route Hotspots in Large Labeled NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295692433:6(2479-2492)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TKDE.2019.2956924
Stucken SGokgoz FSchmitz H(2021)Tactical information aggregation2021 International Conference on Military Communication and Information Systems (ICMCIS)10.1109/ICMCIS52405.2021.9486412(1-7)Online publication date: 4-May-2021
https://doi.org/10.1109/ICMCIS52405.2021.9486412
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents