column

The Data Problem in Data Mining

Author:

Albrecht ZimmermannAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 16, Issue 2

Pages 38 - 45

https://doi.org/10.1145/2783702.2783706

Published: 21 May 2015 Publication History

Abstract

Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data. In this position paper, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking insight into three key aspects: 1) How pattern mining algorithms perform quantitatively, 2) How to choose parameter settings, and 3) How to relate found patterns to the processes that generated the data. I illustrate the issue by surveying existing work in light of these concerns and pointing to the (relatively few) papers that have attempted to fill in the gaps. I argue further that progress regarding those questions is held back by a lack of data with varying, controlled properties, and that this lack is unlikely to be remedied by the ever increasing collection of real-life data. Instead, I am convinced that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.

References

[1]

Corsika - an air shower simulation program, https://web.ikp.kit.edu/corsika/.

[2]

I. Adá and M. R. Berthold. The new iris data: modular data generators. In B. Rao, B. Krishnapuram, A. Tomkins, and Q. Yang, editors, KDD, pages 413--422. ACM, 2010.

Digital Library

[3]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Databases, pages 487--499, Santiago de Chile, Chile, Sept. 1994. Morgan Kaufmann.

Digital Library

[4]

C. Anderson. The end of theory: The data deluge makes the scienti_c method obsolete. http://archive.wired.com/science/discoveries/ magazine/16-07/pb_theory. Accessed 08/21/2014.

[5]

R. J. Bayardo Jr., B. Goethals, and M. J. Zaki, editors. FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, November 1, 2004, 2004.

[6]

J. Besson, C. Rigotti, I. Mitasiunaite, and J.-F. Boulicaut. Parameter tuning for di_erential mining of string patterns. In ICDM Workshops, pages 77--86. IEEE Computer Society, 2008.

Digital Library

[7]

T. D. Bie. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Discov., 23(3):407--446, 2011.

Digital Library

[8]

C. Blake and C. Merz. UCI repository of machine learning databases, 1998.

[9]

M. Boley, T. Gärtner, and H. Grosskreutz. Formal concept sampling for counting and threshold-free local pattern mining. In SDM, pages 177--188. SIAM, 2010.

[10]

M. Boley and H. Grosskreutz. A randomized approach for approximating the number of frequent sets. In ICDM, pages 43--52. IEEE Computer Society, 2008.

Digital Library

[11]

B. Bringmann and A. Zimmermann. Tree2 - Decision trees for tree structured data. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 46--58. Springer, 2005.

[12]

D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1), 2006.

Digital Library

[13]

C. Cooper and M. Zito. Realistic synthetic data for testing association rule mining algorithms for market basket databases. In J. N. Kok, J. Koronacki, R. L. de M_antaras, S. Matwin, D. Mladenic, and A. Skowron, editors, PKDD, volume 4702 of Lecture Notes in Computer Science, pages 398--405. Springer, 2007.

Digital Library

[14]

F. Flouvat, F. D. Marchi, and J.-M. Petit. A new classification of datasets for frequent itemsets. J. Intell. Inf.Syst., 34(1):1--19, 2010.

Digital Library

[15]

A. Freno, M. Keller, and M. Tommasi. Fiedler random fields: A large-scale spectral approach to statistical network modeling. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS, pages 1871--1879, 2012.

[16]

F. Geerts, B. Goethals, and J. V. den Bussche. Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst., 30(2):333--363, 2005.

Digital Library

[17]

S. Ghazizadeh and S. S. Chawathe. Seus: Structure extraction using summaries. In S. Lange, K. Satoh, and C. H. Smith, editors, Discovery Science, volume 2534 of Lecture Notes in Computer Science, pages 71--85. Springer, 2002.

Digital Library

[18]

A. Gionis, H.Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. TKDD, 1(3), 2007.

Digital Library

[19]

B. Goethals and M. J. Zaki, editors. FIMI '03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org, 2003.

[20]

K. Gouda and M. J. Zaki. Genmax: An e_cient algorithm for mining maximal frequent itemsets. Data Min. Knowl. Discov., 11(3):223--242, 2005.

Digital Library

[21]

J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan: frequent pattern-projected sequential pattern mining. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo, and I. Parsa, editors, KDD, pages 355--359. ACM, 2000.

Digital Library

[22]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 1--12. ACM, 2000.

Digital Library

[23]

J. Han, B. W. Wah, V. Raghavan, X. Wu, and R. Rastogi, editors. Fifth IEEE International Conference on Data Mining, Houston, Texas, USA, Nov. 2005. IEEE.

[24]

J. Huan, W. Wang, and J. Prins. E_cient mining of frequent subgraphs in the presence of isomorphism. In ICDM, pages 549--552. IEEE Computer Society, 2003.

Digital Library

[25]

A. Inokuchi, T. Washio, and H. Motoda. An aprioribased algorithm for mining frequent substructures from graph data. In D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, editors, PKDD, volume 1910 of Lecture Notes in Computer Science, pages 13--23. Springer, 2000.

Digital Library

[26]

A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm for mining frequent connected subgraphs. Technical report, IBM Research, 2002.

[27]

E. F. V. J. J. Down. A plant-wide industrial process control problem. Computers & Chemical Engineering, 17(3):245--255, March 1993.

[28]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang, H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229--238. IEEE Computer Society, 2009.

Digital Library

[29]

E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification/clustering homepage, 2011.

[30]

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In N. Cercone, T. Y. Lin, and X. Wu, editors, ICDM, pages 313--320. IEEE Computer Society, 2001.

Digital Library

[31]

M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph*. Data Min. Knowl. Discov., 11(3):243--271, 2005.

Digital Library

[32]

S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. Discovering frequent episodes and learning hidden markov models: A formal connection. IEEE Trans. Knowl. Data Eng., 17(11):1505--1517, 2005.

Digital Library

[33]

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, editors, WWW, pages 695--704. ACM, 2008.

Digital Library

[34]

L. Lhote, F. Rioult, and A. Soulet. Average number of frequent (closed) patterns in bernouilli and markovian databases. In Han et al. {23}, pages 713--716.

Digital Library

[35]

M. Mampaey and J. Vreeken. Summarizing categorical data by clustering attributes. Data Min. Knowl. Discov., 26(1):130--173, 2013.

Digital Library

[36]

M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets. TKDD, 6(4):16, 2012.

Digital Library

[37]

H. Mannila and H. Toivonen. Discovering frequent episodes in sequences. In Proceedings of the First Inter- national Conference on Knowledge Discovery and Data Mining (KDD'95), pages 210--215. AAAI Press, 1995.

[38]

A. U. Matthijs van Leeuwen. Fast estimation of the pattern frequency spectrum.

[39]

R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824--827, 2002.

[40]

S. Nijssen and J. Kok. Frequent subgraph miners: runtimes don't say everything. In T. Gärtner, G. Garriga, and T. Meinl, editors, Proceedings of the Workshop on Mining and Learning with Graphs, pages 173--180, 2006.

[41]

G. K. Orman, V. Labatut, and H. Cheri_. Qualitative comparison of community detection algorithms. In H. Cheri_, J. M. Zain, and E. El-Qawasmeh, editors, DICTAP (2), volume 167 of Communications in Computer and Information Science, pages 265--279. Springer, 2011.

[42]

G. K. Orman, V. Labatut, and H. Cheri_. Towards realistic arti_cial benchmark for community detection algorithms evaluation. IJWBC, 9(3):349--370, 2013.

Digital Library

[43]

P. Palmerini, S. Orlando, and R. Perego. Statistical properties of transactional databases. In H. Haddad, A. Omicini, R. L. Wainwright, and L. M. Liebrock, editors, SAC, pages 515--519. ACM, 2004.

Digital Library

[44]

J. Pei, J. Han, and R. Mao. Closet: An e_cient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21--30, 2000.

[45]

J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by pre_x-projected growth. In D. Georgakopoulos and A. Buchmann, editors, ICDE, pages 215--224. IEEE Computer Society, 2001.

Digital Library

[46]

Y. Pei and O. Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, 2006.

[47]

D. M. Pennock and Q. F. Stout. Exploiting a theory of phase transitions in three-satis_ability problems. In AAAI/IAAI, Vol. 1, pages 253--258, 1996.

Digital Library

[48]

B. A. Prakash, J. Vreeken, and C. Faloutsos. E_ciently spotting the starting points of an epidemic in a large graph. Knowl. Inf. Syst., 38(1):35--59, 2014.

[49]

G. Ramesh, W. Maniatty, and M. J. Zaki. Feasible itemset distributions in data mining: theory and application. In PODS, pages 284--295. ACM, 2003.

Digital Library

[50]

G. Ramesh, M. J. Zaki, and W. Maniatty. Distributionbased synthetic database generation techniques for itemset mining. In IDEAS, pages 307--316. IEEE Computer Society, 2005.

Digital Library

[51]

R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, editors, EDBT, volume 1057 of Lecture Notes in Computer Science, pages 3--17. Springer, 1996.

Digital Library

[52]

N. Tatti and J. Vreeken. Discovering descriptive tile trees - by mining optimal geometric subtiles. In P. A. Flach, T. D. Bie, and N. Cristianini, editors, Machine Learning and Knowledge Discovery in Databases - Eu- ropean Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part I, volume 7523 of Lecture Notes in Computer Science, pages 9--24. Springer, 2012.

[53]

N. Tatti and J. Vreeken. The long and the short of it: summarising event sequences with serial episodes. In Q. Yang, D. Agarwal, and J. Pei, editors, The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12-16, 2012, pages 462--470. ACM, 2012.

Digital Library

[54]

J. Vanschoren, H. Blockeel, B. Pfahringer, and G. Holmes. Experiment databases - A new way to share, organize and learn from experiments. Machine Learning, 87(2):127--158, 2012.

Digital Library

[55]

J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In N. Ramakrishnan and O. Zaane, editors, ICDM, pages 685--690. IEEE Computer Society, 2007.

Digital Library

[56]

J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., 23(1):169--214, 2011.

Digital Library

[57]

G. I. Webb. Discovering signi_cant patterns. Machine Learning, 68(1):1--33, 2007.

Digital Library

[58]

G. I. Webb and J. Vreeken. E_cient discovery of the most interesting associations. Transactions on Knowl- edge Discovery from Data, 8(3):15--1, 2014.

Digital Library

[59]

M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen. A quantitative comparison of the subgraph miners mofa, gspan, _sm, and gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, PKDD, pages 392--403. Springer, 2005.

Digital Library

[60]

X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721--724. IEEE Computer Society, 2002.

Digital Library

[61]

M. J. Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372--390, 2000.

Digital Library

[62]

M. J. Zaki. Spade: An e_cient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31--60, 2001.

Digital Library

[63]

M. J. Zaki. E_ciently mining frequent trees in a forest. In KDD, pages 71--80. ACM, 2002.

Digital Library

[64]

M. J. Zaki and C.-J. Hsiao. Charm: An e_cient algorithm for closed itemset mining. In R. L. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, editors, SDM. SIAM, 2002.

[65]

Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In KDD, pages 401--406, 2001.

Digital Library

[66]

A. Zimmermann. Objectively evaluating condensed representations and interestingness measures for frequent itemset mining. Journal of Intelligent Information Systems, pages 1--19, 2013.

[67]

A. Zimmermann. Understanding episode mining techniques: Benchmarking on diverse, realistic, arti_cial data. Intell. Data Anal., 18(5):761--791, 2014.

Digital Library

[68]

A. Zimmermann and B. Bringmann. Ctc - correlating tree patterns for classification. In Han et al. {23}, pages 833--836.

Digital Library

Cited By

Riondato M(2023)Statistically-Sound Knowledge Discovery from Data: Challenges and Directions2023 IEEE 5th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI58952.2023.00023(97-102)Online publication date: 1-Nov-2023
https://doi.org/10.1109/CogMI58952.2023.00023
Abuissa MLee ARiondato M(2023)ROhAN: Row-order agnostic null models for statistically-sound knowledge discoveryData Mining and Knowledge Discovery10.1007/s10618-023-00938-437:4(1692-1718)Online publication date: 6-May-2023
https://dl.acm.org/doi/10.1007/s10618-023-00938-4
Medenou Choumanof RLlopis Sanchez SCalzado Mayo VGarcia Balufo MPáramo Castrillo MGonzález Garrido FLuis Martinez ANevado Catalán DHu ARodríguez-Bermejo DPasqual de Riquelme GSotelo Monge MBerardi ADe Santis PTorelli FMaestre Vidal J(2022)Introducing the CYSAS-S3 Dataset for Operationalizing a Mission-Oriented Cyber Situational AwarenessSensors10.3390/s2214510422:14(5104)Online publication date: 7-Jul-2022
https://doi.org/10.3390/s22145104
Show More Cited By

Index Terms

The Data Problem in Data Mining

Recommendations

Mining uncertain data

As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the ...
Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Discovery of unexpected patterns in data mining applications

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 16, Issue 2

December 2014

49 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/2783702

Editors:
Charu Aggarwal
IBM T.J. Watson
,
Haixun Wang
Google
,
Ankur M. Teredesai
University of Washington Tacoma
,
Hanghang Tong
Arizona State University

Issue’s Table of Contents

Copyright © 2015 Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2015

Published in SIGKDD Volume 16, Issue 2

Check for updates

Qualifiers

Column

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
297
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Riondato M(2023)Statistically-Sound Knowledge Discovery from Data: Challenges and Directions2023 IEEE 5th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI58952.2023.00023(97-102)Online publication date: 1-Nov-2023
https://doi.org/10.1109/CogMI58952.2023.00023
Abuissa MLee ARiondato M(2023)ROhAN: Row-order agnostic null models for statistically-sound knowledge discoveryData Mining and Knowledge Discovery10.1007/s10618-023-00938-437:4(1692-1718)Online publication date: 6-May-2023
https://dl.acm.org/doi/10.1007/s10618-023-00938-4
Medenou Choumanof RLlopis Sanchez SCalzado Mayo VGarcia Balufo MPáramo Castrillo MGonzález Garrido FLuis Martinez ANevado Catalán DHu ARodríguez-Bermejo DPasqual de Riquelme GSotelo Monge MBerardi ADe Santis PTorelli FMaestre Vidal J(2022)Introducing the CYSAS-S3 Dataset for Operationalizing a Mission-Oriented Cyber Situational AwarenessSensors10.3390/s2214510422:14(5104)Online publication date: 7-Jul-2022
https://doi.org/10.3390/s22145104
Vidal JMonge MMonterrubio S(2020)Anomaly-Based Intrusion DetectionHandbook of Research on Machine and Deep Learning Applications for Cyber Security10.4018/978-1-5225-9611-0.ch010(195-218)Online publication date: 2020
https://doi.org/10.4018/978-1-5225-9611-0.ch010
Maestre Vidal JMonge MMonterrubio S(2019)EsPADA: Enhanced Payload Analyzer for malware Detection robust against Adversarial threatsFuture Generation Computer Systems10.1016/j.future.2019.10.022Online publication date: Oct-2019
https://doi.org/10.1016/j.future.2019.10.022
Zimmermann A(2019)Method evaluation, parameterization, and result validation in unsupervised data mining: A critical surveyWIREs Data Mining and Knowledge Discovery10.1002/widm.133010:2Online publication date: 29-Jul-2019
https://doi.org/10.1002/widm.1330
Campos GZimek ASander JCampello RMicenková BSchubert EAssent IHoule M(2016)On the evaluation of unsupervised outlier detectionData Mining and Knowledge Discovery10.1007/s10618-015-0444-830:4(891-927)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1007/s10618-015-0444-8
Schubert EWeiler MZimek A(2015)Outlier Detection and Trend DetectionProceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW)10.1109/ICDMW.2015.79(40-46)Online publication date: 14-Nov-2015
https://dl.acm.org/doi/10.1109/ICDMW.2015.79

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents