Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
column

The Data Problem in Data Mining

Published: 21 May 2015 Publication History

Abstract

Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data. In this position paper, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking insight into three key aspects: 1) How pattern mining algorithms perform quantitatively, 2) How to choose parameter settings, and 3) How to relate found patterns to the processes that generated the data. I illustrate the issue by surveying existing work in light of these concerns and pointing to the (relatively few) papers that have attempted to fill in the gaps. I argue further that progress regarding those questions is held back by a lack of data with varying, controlled properties, and that this lack is unlikely to be remedied by the ever increasing collection of real-life data. Instead, I am convinced that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.

References

[1]
Corsika - an air shower simulation program, https://web.ikp.kit.edu/corsika/.
[2]
I. Adá and M. R. Berthold. The new iris data: modular data generators. In B. Rao, B. Krishnapuram, A. Tomkins, and Q. Yang, editors, KDD, pages 413--422. ACM, 2010.
[3]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Databases, pages 487--499, Santiago de Chile, Chile, Sept. 1994. Morgan Kaufmann.
[4]
C. Anderson. The end of theory: The data deluge makes the scienti_c method obsolete. http://archive.wired.com/science/discoveries/ magazine/16-07/pb_theory. Accessed 08/21/2014.
[5]
R. J. Bayardo Jr., B. Goethals, and M. J. Zaki, editors. FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, November 1, 2004, 2004.
[6]
J. Besson, C. Rigotti, I. Mitasiunaite, and J.-F. Boulicaut. Parameter tuning for di_erential mining of string patterns. In ICDM Workshops, pages 77--86. IEEE Computer Society, 2008.
[7]
T. D. Bie. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Discov., 23(3):407--446, 2011.
[8]
C. Blake and C. Merz. UCI repository of machine learning databases, 1998.
[9]
M. Boley, T. Gärtner, and H. Grosskreutz. Formal concept sampling for counting and threshold-free local pattern mining. In SDM, pages 177--188. SIAM, 2010.
[10]
M. Boley and H. Grosskreutz. A randomized approach for approximating the number of frequent sets. In ICDM, pages 43--52. IEEE Computer Society, 2008.
[11]
B. Bringmann and A. Zimmermann. Tree2 - Decision trees for tree structured data. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 46--58. Springer, 2005.
[12]
D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1), 2006.
[13]
C. Cooper and M. Zito. Realistic synthetic data for testing association rule mining algorithms for market basket databases. In J. N. Kok, J. Koronacki, R. L. de M_antaras, S. Matwin, D. Mladenic, and A. Skowron, editors, PKDD, volume 4702 of Lecture Notes in Computer Science, pages 398--405. Springer, 2007.
[14]
F. Flouvat, F. D. Marchi, and J.-M. Petit. A new classification of datasets for frequent itemsets. J. Intell. Inf.Syst., 34(1):1--19, 2010.
[15]
A. Freno, M. Keller, and M. Tommasi. Fiedler random fields: A large-scale spectral approach to statistical network modeling. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS, pages 1871--1879, 2012.
[16]
F. Geerts, B. Goethals, and J. V. den Bussche. Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst., 30(2):333--363, 2005.
[17]
S. Ghazizadeh and S. S. Chawathe. Seus: Structure extraction using summaries. In S. Lange, K. Satoh, and C. H. Smith, editors, Discovery Science, volume 2534 of Lecture Notes in Computer Science, pages 71--85. Springer, 2002.
[18]
A. Gionis, H.Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. TKDD, 1(3), 2007.
[19]
B. Goethals and M. J. Zaki, editors. FIMI '03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org, 2003.
[20]
K. Gouda and M. J. Zaki. Genmax: An e_cient algorithm for mining maximal frequent itemsets. Data Min. Knowl. Discov., 11(3):223--242, 2005.
[21]
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan: frequent pattern-projected sequential pattern mining. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo, and I. Parsa, editors, KDD, pages 355--359. ACM, 2000.
[22]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 1--12. ACM, 2000.
[23]
J. Han, B. W. Wah, V. Raghavan, X. Wu, and R. Rastogi, editors. Fifth IEEE International Conference on Data Mining, Houston, Texas, USA, Nov. 2005. IEEE.
[24]
J. Huan, W. Wang, and J. Prins. E_cient mining of frequent subgraphs in the presence of isomorphism. In ICDM, pages 549--552. IEEE Computer Society, 2003.
[25]
A. Inokuchi, T. Washio, and H. Motoda. An aprioribased algorithm for mining frequent substructures from graph data. In D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, editors, PKDD, volume 1910 of Lecture Notes in Computer Science, pages 13--23. Springer, 2000.
[26]
A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm for mining frequent connected subgraphs. Technical report, IBM Research, 2002.
[27]
E. F. V. J. J. Down. A plant-wide industrial process control problem. Computers & Chemical Engineering, 17(3):245--255, March 1993.
[28]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang, H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229--238. IEEE Computer Society, 2009.
[29]
E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification/clustering homepage, 2011.
[30]
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In N. Cercone, T. Y. Lin, and X. Wu, editors, ICDM, pages 313--320. IEEE Computer Society, 2001.
[31]
M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph*. Data Min. Knowl. Discov., 11(3):243--271, 2005.
[32]
S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. Discovering frequent episodes and learning hidden markov models: A formal connection. IEEE Trans. Knowl. Data Eng., 17(11):1505--1517, 2005.
[33]
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, editors, WWW, pages 695--704. ACM, 2008.
[34]
L. Lhote, F. Rioult, and A. Soulet. Average number of frequent (closed) patterns in bernouilli and markovian databases. In Han et al. {23}, pages 713--716.
[35]
M. Mampaey and J. Vreeken. Summarizing categorical data by clustering attributes. Data Min. Knowl. Discov., 26(1):130--173, 2013.
[36]
M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets. TKDD, 6(4):16, 2012.
[37]
H. Mannila and H. Toivonen. Discovering frequent episodes in sequences. In Proceedings of the First Inter- national Conference on Knowledge Discovery and Data Mining (KDD'95), pages 210--215. AAAI Press, 1995.
[38]
A. U. Matthijs van Leeuwen. Fast estimation of the pattern frequency spectrum.
[39]
R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824--827, 2002.
[40]
S. Nijssen and J. Kok. Frequent subgraph miners: runtimes don't say everything. In T. Gärtner, G. Garriga, and T. Meinl, editors, Proceedings of the Workshop on Mining and Learning with Graphs, pages 173--180, 2006.
[41]
G. K. Orman, V. Labatut, and H. Cheri_. Qualitative comparison of community detection algorithms. In H. Cheri_, J. M. Zain, and E. El-Qawasmeh, editors, DICTAP (2), volume 167 of Communications in Computer and Information Science, pages 265--279. Springer, 2011.
[42]
G. K. Orman, V. Labatut, and H. Cheri_. Towards realistic arti_cial benchmark for community detection algorithms evaluation. IJWBC, 9(3):349--370, 2013.
[43]
P. Palmerini, S. Orlando, and R. Perego. Statistical properties of transactional databases. In H. Haddad, A. Omicini, R. L. Wainwright, and L. M. Liebrock, editors, SAC, pages 515--519. ACM, 2004.
[44]
J. Pei, J. Han, and R. Mao. Closet: An e_cient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21--30, 2000.
[45]
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by pre_x-projected growth. In D. Georgakopoulos and A. Buchmann, editors, ICDE, pages 215--224. IEEE Computer Society, 2001.
[46]
Y. Pei and O. Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, 2006.
[47]
D. M. Pennock and Q. F. Stout. Exploiting a theory of phase transitions in three-satis_ability problems. In AAAI/IAAI, Vol. 1, pages 253--258, 1996.
[48]
B. A. Prakash, J. Vreeken, and C. Faloutsos. E_ciently spotting the starting points of an epidemic in a large graph. Knowl. Inf. Syst., 38(1):35--59, 2014.
[49]
G. Ramesh, W. Maniatty, and M. J. Zaki. Feasible itemset distributions in data mining: theory and application. In PODS, pages 284--295. ACM, 2003.
[50]
G. Ramesh, M. J. Zaki, and W. Maniatty. Distributionbased synthetic database generation techniques for itemset mining. In IDEAS, pages 307--316. IEEE Computer Society, 2005.
[51]
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, editors, EDBT, volume 1057 of Lecture Notes in Computer Science, pages 3--17. Springer, 1996.
[52]
N. Tatti and J. Vreeken. Discovering descriptive tile trees - by mining optimal geometric subtiles. In P. A. Flach, T. D. Bie, and N. Cristianini, editors, Machine Learning and Knowledge Discovery in Databases - Eu- ropean Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part I, volume 7523 of Lecture Notes in Computer Science, pages 9--24. Springer, 2012.
[53]
N. Tatti and J. Vreeken. The long and the short of it: summarising event sequences with serial episodes. In Q. Yang, D. Agarwal, and J. Pei, editors, The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12-16, 2012, pages 462--470. ACM, 2012.
[54]
J. Vanschoren, H. Blockeel, B. Pfahringer, and G. Holmes. Experiment databases - A new way to share, organize and learn from experiments. Machine Learning, 87(2):127--158, 2012.
[55]
J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In N. Ramakrishnan and O. Zaane, editors, ICDM, pages 685--690. IEEE Computer Society, 2007.
[56]
J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., 23(1):169--214, 2011.
[57]
G. I. Webb. Discovering signi_cant patterns. Machine Learning, 68(1):1--33, 2007.
[58]
G. I. Webb and J. Vreeken. E_cient discovery of the most interesting associations. Transactions on Knowl- edge Discovery from Data, 8(3):15--1, 2014.
[59]
M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen. A quantitative comparison of the subgraph miners mofa, gspan, _sm, and gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, PKDD, pages 392--403. Springer, 2005.
[60]
X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721--724. IEEE Computer Society, 2002.
[61]
M. J. Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372--390, 2000.
[62]
M. J. Zaki. Spade: An e_cient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31--60, 2001.
[63]
M. J. Zaki. E_ciently mining frequent trees in a forest. In KDD, pages 71--80. ACM, 2002.
[64]
M. J. Zaki and C.-J. Hsiao. Charm: An e_cient algorithm for closed itemset mining. In R. L. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, editors, SDM. SIAM, 2002.
[65]
Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In KDD, pages 401--406, 2001.
[66]
A. Zimmermann. Objectively evaluating condensed representations and interestingness measures for frequent itemset mining. Journal of Intelligent Information Systems, pages 1--19, 2013.
[67]
A. Zimmermann. Understanding episode mining techniques: Benchmarking on diverse, realistic, arti_cial data. Intell. Data Anal., 18(5):761--791, 2014.
[68]
A. Zimmermann and B. Bringmann. Ctc - correlating tree patterns for classification. In Han et al. {23}, pages 833--836.

Cited By

View all
  • (2023)Statistically-Sound Knowledge Discovery from Data: Challenges and Directions2023 IEEE 5th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI58952.2023.00023(97-102)Online publication date: 1-Nov-2023
  • (2023)ROhAN: Row-order agnostic null models for statistically-sound knowledge discoveryData Mining and Knowledge Discovery10.1007/s10618-023-00938-437:4(1692-1718)Online publication date: 6-May-2023
  • (2022)Introducing the CYSAS-S3 Dataset for Operationalizing a Mission-Oriented Cyber Situational AwarenessSensors10.3390/s2214510422:14(5104)Online publication date: 7-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 16, Issue 2
December 2014
49 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2783702
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2015
Published in SIGKDD Volume 16, Issue 2

Check for updates

Qualifiers

  • Column

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Statistically-Sound Knowledge Discovery from Data: Challenges and Directions2023 IEEE 5th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI58952.2023.00023(97-102)Online publication date: 1-Nov-2023
  • (2023)ROhAN: Row-order agnostic null models for statistically-sound knowledge discoveryData Mining and Knowledge Discovery10.1007/s10618-023-00938-437:4(1692-1718)Online publication date: 6-May-2023
  • (2022)Introducing the CYSAS-S3 Dataset for Operationalizing a Mission-Oriented Cyber Situational AwarenessSensors10.3390/s2214510422:14(5104)Online publication date: 7-Jul-2022
  • (2020)Anomaly-Based Intrusion DetectionHandbook of Research on Machine and Deep Learning Applications for Cyber Security10.4018/978-1-5225-9611-0.ch010(195-218)Online publication date: 2020
  • (2019)EsPADA: Enhanced Payload Analyzer for malware Detection robust against Adversarial threatsFuture Generation Computer Systems10.1016/j.future.2019.10.022Online publication date: Oct-2019
  • (2019)Method evaluation, parameterization, and result validation in unsupervised data mining: A critical surveyWIREs Data Mining and Knowledge Discovery10.1002/widm.133010:2Online publication date: 29-Jul-2019
  • (2016)On the evaluation of unsupervised outlier detectionData Mining and Knowledge Discovery10.1007/s10618-015-0444-830:4(891-927)Online publication date: 1-Jul-2016
  • (2015)Outlier Detection and Trend DetectionProceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW)10.1109/ICDMW.2015.79(40-46)Online publication date: 14-Nov-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media