Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Discovering all most specific sentences

Published: 01 June 2003 Publication History

Abstract

Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.

References

[1]
Agrawal, R. C., Aggarwal, C. C., and Prasad, V. V. V. 2000. Depth first generation of long patterns. In Knowledge Discovery and Data Mining, pp. 108--118.]]
[2]
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'93) (May). ACM, New York, pp. 207--216.]]
[3]
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, Menlo Park, Calif., pp. 307--328.]]
[4]
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94) (Sept.). pp. 487--499.]]
[5]
Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95) (Taipei, Taiwan, Mar.). pp. 3--14.]]
[6]
Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, (Apr.), 319--342.]]
[7]
Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York.]]
[8]
Bell, S. 2003. Deciding distinctness of query results by discovered constraints. Manuscript.]]
[9]
Bell, S. and Brockhausen, P. 1995. Discovery of data dependencies in relational databases. Tech. Rep. LS-8 14. Universität Dortmund, Fachbereich Informatik, Lehrstuhl VIII, Künstliche Intelligenz.]]
[10]
Berge, C. 1973. Hypergraphs. Combinatorics of Finite Sets, 3rd ed. North-Holland, Amsterdam, The Netherlands.]]
[11]
Bioch, J. and Ibaraki, T. 1995. Complexity of identification and dualization of positive Boolean functions. Inf. Comput. 123, 1, 50--63.]]
[12]
Bshouty, N. H. 1996. The monotone theory for the pac-model. In Proceedings of the ACM Symposium on Theory of Computing (STOC). ACM, New York.]]
[13]
Bshouty, N. H., Cleve, R., Gavalda, R., Kannan, S., and Tamon, C. 1996. Oracles and queries that are sufficient for exact learning. J. Comput. Syst. Sci. 52, 421--433.]]
[14]
Burdick, D., Calimlim, M., and Gehrke, J. 2001. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the International Conference on Data Engineering.]]
[15]
Eiter, T. and Gottlob, G. 1995. Identifying the minimal transversals of a hypergraph and related problems. SIAM J. Comput. 24, 6 (Dec.), 1278--1304.]]
[16]
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. 1996. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, Calif.]]
[17]
Fredman, M. L. and Khachiyan, L. 1996. On the complexity of dualization of monotone disjunctive normal forms. J. Algorithms 21, 3 (Nov.), 618--628.]]
[18]
Garey, M. and Johnson, D. 1979. Computers and Intractability---A Guide to the Theory of NP-Completeness. W. H. Freeman, New York.]]
[19]
Gouda, K. and Zaki, M. J. 2001. Efficiently mining maximal frequent itemsets. In ICDM, pp. 163--170.]]
[20]
Gunopulos, D., Khardon, R., Mannila, H., and Toivonen, H. 1997a. Data mining, hypergraph transversals, and machine learning. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97). ACM, New York.]]
[21]
Gunopulos, D., Mannila, H., and Saluja, S. 1997b. Discovering all most specific sentences by randomized algorithms. In Proceedings of the International Conference on Database Theory (ICDT'97).]]
[22]
Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp. 1--12.]]
[23]
Kavvadias, D. and Stavropoulos, E. C. 1999. Evaluation of an algorithm for the transversal hypergraph problem. In Algorithm Engineering, pp. 72--84.]]
[24]
Khardon, R. 1995. Translating between Horn representations and their characteristic models. J. Artif. Intel. Res. 3, 349--372.]]
[25]
Knobbe, A. J. and Adriaans, P. W. 1995. Discovering foreign key relations in relational databases. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 94--99.]]
[26]
Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann, San Mateo, Calif.]]
[27]
Lin, D.-I. and Kedem, Z. M. 1998. Pincer search: A new algorithm for discovering the maximum frequent set. In Extending Database Technology, pp. 105--119.]]
[28]
Mannila, H. 1995. Aspects of data mining. In Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases (Heraklion, Crete, Greece, Apr.). pp. 1--6.]]
[29]
Mannila, H. 1996. Data mining: Machine learning, statistics, and databases. In Proceedings of the 8th International Conference on Scientific and Statistical Database Management (Stockholm, Sweden). pp, 2--9.]]
[30]
Mannila, H. and Räihä, K.-J. 1986. Design by example: An application of Armstrong relations. J. Comput. Syst. Sci. 33, 2, 126--141.]]
[31]
Mannila, H. and Räihä, K.-J. 1992. Design of Relational Databases. Addison-Wesley, Wokingham, U.K.]]
[32]
Mannila, H. and Räihä, K.-J. 1994. Algorithms for inferring functional dependencies. Data Knowl. Eng. 12, 1 (Feb.), 83--99.]]
[33]
Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl. Disc. 1, 3, 241--258.]]
[34]
Mannila, H., Toivonen, H., and Verkamo, A. I. 1994. Efficient algorithms for discovering association rules. In Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD'94) (Seattle, Wash., July), pp. 181--192.]]
[35]
Mannila, H., Toivonen, H., and Verkamo, A. I. 1995. Discovering frequent episodes in sequences. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD'95) (Montreal, Ont., Canada, Aug.). AAAI Press, pp. 210--215.]]
[36]
Mishra, N. and Pitt, L. 1997. Generating all maximal independent sets of bounded-degree hypergraphs. In Proceedings of the Conference on Computational Learning Theory (COLT). pp. 211--217.]]
[37]
Mitchell, T. M. 1982. Generalization as search. Artif. Intel. 18, 203--226.]]
[38]
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning.]]
[39]
Schlimmer, J. 1993. Using learned dependencies to automatically construct sufficient and sensible editing views. In Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop (KDD'93) (Washington, D.C.). AAAI Press, pp. 186--196.]]
[40]
Ullman, J. D. 1988. Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville, Md.]]
[41]
Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.]]
[42]
Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.]]

Cited By

View all
  • (2023)Equitable Top-k Results for Long Tail DataProceedings of the ACM on Management of Data10.1145/36267271:4(1-24)Online publication date: 12-Dec-2023
  • (2023)Composite Object Normal Forms: Parameterizing Boyce-Codd Normal Form by the Number of Minimal KeysProceedings of the ACM on Management of Data10.1145/35886931:1(1-25)Online publication date: 30-May-2023
  • (2023)Selecting Sub-tables for Data Exploration2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00192(2496-2509)Online publication date: Apr-2023
  • Show More Cited By

Index Terms

  1. Discovering all most specific sentences

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 28, Issue 2
    June 2003
    108 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/777943
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2003
    Published in TODS Volume 28, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data mining
    2. association rules
    3. learning with membership queries
    4. maximal frequent sets
    5. minimal keys

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Equitable Top-k Results for Long Tail DataProceedings of the ACM on Management of Data10.1145/36267271:4(1-24)Online publication date: 12-Dec-2023
    • (2023)Composite Object Normal Forms: Parameterizing Boyce-Codd Normal Form by the Number of Minimal KeysProceedings of the ACM on Management of Data10.1145/35886931:1(1-25)Online publication date: 30-May-2023
    • (2023)Selecting Sub-tables for Data Exploration2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00192(2496-2509)Online publication date: Apr-2023
    • (2023)Frequent PatternEncyclopedia of Machine Learning and Data Science10.1007/978-1-4899-7502-7_106-1(1-6)Online publication date: 12-Apr-2023
    • (2022)Interpretable Reinforcement Learning with Multilevel Subgoal Discovery2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA55696.2022.00043(251-258)Online publication date: Dec-2022
    • (2022)Introducing the closure structure and the GDPM algorithm for mining and understanding a tabular datasetInternational Journal of Approximate Reasoning10.1016/j.ijar.2021.12.012145:C(75-90)Online publication date: 1-Jun-2022
    • (2022)Fast Estimation of the Pattern Frequency SpectrumMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44851-9_8(114-129)Online publication date: 10-Mar-2022
    • (2021)Cache-based GNN System for Dynamic GraphsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482237(937-946)Online publication date: 26-Oct-2021
    • (2021)Finding Route Hotspots in Large Labeled NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295692433:6(2479-2492)Online publication date: 1-Jun-2021
    • (2021)Tactical information aggregation2021 International Conference on Military Communication and Information Systems (ICMCIS)10.1109/ICMCIS52405.2021.9486412(1-7)Online publication date: 4-May-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media