short-paper

Automatic Dataset Type Recognition for Association Rule Mining

Authors:

Konstantinos Malliaridis,

Stefanos Ougiaroglou,

Dimitris A. Dervos,

Charalampos Bratsas,

Antonis Sidiropoulos,

Konstantinos DiamantarasAuthors Info & Claims

SETN '24: Proceedings of the 13th Hellenic Conference on Artificial Intelligence

Article No.: 15, Pages 1 - 8

https://doi.org/10.1145/3688671.3688767

Published: 27 December 2024 Publication History

Abstract

Association Rule Mining is an important subfield of data mining, which consists of extracting interesting associations between items that coexist in transactions on databases. The transactions dataset may be of different types, like (a) a market basket list, where each line represents a transaction, (b) invoice detail, directly derived from ERP company prints, (c) a sparse matrix with as many columns as the different types considered for mining, and (d) nominal attributes, mainly consisting of categorical features. The classification of a given input into the correct dataset type is crucial in automated machine learning tasks. In this paper, we report on the development of an automatic dataset type recognition mechanism. A specialized "Dataset of Datasets" is created from a variety of datasets distributed by well-known repositories. Ultimately, we build a hybrid classification model consisting of a procedural programming component and a pre-trained Supervised Machine Learning model based on the Random Forest algorithm. The classification accuracy achieved is of the order of 98%. The Random Forest algorithm has been chosen after considering a number of popular machine learning algorithms like the Naïve Bayes, Decision Tree, K-Nearest Neighbor (K-NN), SVM, as well as their variants.

References

[1]

David Aha, Dennis Kibler, and Marc Albert. 1991. Instance-Based Learning Algorithms. Machine Learning 6, 1 (1991), 37–66.

[2]

Arthur Asuncion and David Newman. 2007. UCI Machine Learning Repository. https://archive.ics.uci.edu/datasets

[3]

Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Kötter, Thorsten Meinl, Peter Ohl, Kilian Thiel, and Bernd Wiswedel. 2008. KNIME: The Konstanz Information Miner. Studies in classification, data analysis, and knowledge organization 1 (2008), 319–326.

[4]

Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. 2013. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research 14 (2013), 2349–2353. http://jmlr.org/papers/v14/demsar13a.html

[5]

Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matthew D. Hoffman, and Rif A. Saurous. 2017. TensorFlow Distributions. CoRR abs/1711.10604 (2017), 1–13. arXiv:https://arXiv.org/abs/1711.10604 http://arxiv.org/abs/1711.10604

[6]

Stephan Dreiseitl and Lucila Ohno-Machado. 2002. Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics 35, 5-6 (2002), 352–359.

[7]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. Advances in neural information processing systems 28 (2015), 2962–2970.

[8]

Python Software Foundation. 2020. csv — CSV File Reading and Writing — Python 3.8.1 documentation. https://docs.python.org/3/library/csv.html

[9]

E. Frank, M. A. Hall, G. Holmes, R. Kirkby, B. Pfahringer, and I. H. Witten. 2005. Weka: A machine learning workbench for data mining.Springer, Berlin, 1305–1314. http://researchcommons.waikato.ac.nz/handle/10289/1497

[10]

Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine learning 29 (1997), 131–163.

[11]

Johannes Fürnkranz. 2010. Decision Tree. Springer US, Boston, MA, 263–267.

[12]

Stephen I Gallant et al. 1990. Perceptron-based learning algorithms. IEEE Transactions on neural networks 1, 2 (1990), 179–191.

[13]

Migran N Gevorkyan, Anastasia V Demidova, Tatiana S Demidova, and Anton A Sobolev. 2019. Review and comparative analysis of machine learning libraries for machine learning. Discrete and Continuous Models and Applied Computational Science 27, 4 (2019), 305–315.

[14]

Gösta Grahne and Jianfei Zhu. 2005. Fast algorithms for frequent itemset mining using fp-trees. IEEE transactions on knowledge and data engineering 17, 10 (2005), 1347–1362.

[15]

Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta. 2011. The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Datasets. Journal of Machine Learning Research 12 (2011), 1977–1981. https://jmlr.csail.mit.edu/papers/v12/hahsler11a.html

[16]

Michael Hahsler, Bettina Gruen, and Kurt Hornik. 2005. arules – A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software 14, 15 (October 2005), 1–25.

[17]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18.

[18]

Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, Vol. 1. IEEE, IEEE, Mondreal, Canada, 278–282.

[19]

Ben Johnson and Anjana S Chandran. 2021. ‘Comparison between Python, Java and R progrmming language in machine learning. Int. Res. J. Modernization Eng. Technol. Sci 3, 6 (2021), 1–6.

[20]

Konstantinos Kelesidis, Nikoletta Fotopoulou, and Dimitris Dervos. 2024. Correlation as an ARM Interestingness Measure for Numeric Datasets. In Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics (, Lamia, Greece,) (PCI ’23). Association for Computing Machinery, New York, NY, USA, 1–7.

Digital Library

[21]

Eshwari Girish Kulkarni and Raj B Kulkarni. 2016. Weka powerful tool in data mining. International Journal of Computer Applications 975 (2016), 8887.

[22]

Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) 2020 (July 2020). https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf

[23]

Richard P Lippmann. 1989. Pattern classification using neural networks. IEEE communications magazine 27, 11 (1989), 47–50.

[24]

Konstantinos Malliaridis. 2024. DatasetofDatasets (DoD). https://www.kaggle.com/datasets/terminalgr/datasetofdatasets-124-1242024/data

[25]

Konstantinos Malliaridis, Stefanos Ougiaroglou, and Dimitris A. Dervos. 2020. WebApriori: A Web Application for Association Rules Mining. In Intelligent Tutoring Systems, Vivekanandan Kumar and Christos Troussas (Eds.). Springer International Publishing, Cham, 371–377.

Digital Library

[26]

W. McKinney. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, California, USA. https://books.google.gr/books?id=v3n4_AK8vu0C

[27]

Costas Neocleous and Christos Schizas. 2002. Artificial Neural Network Learning: A Comparative Review. In Methods and Applications of Artificial Intelligence, Ioannis P. Vlahavas and Constantine D. Spyropoulos (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 300–313.

[28]

FY Osisanwo, JET Akinsola, O Awodele, JO Hinmikaiye, O Olakanmi, J Akinjobi, et al. 2017. Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT) 48, 3 (2017), 128–138.

[29]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Vancouver, Canada. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

[30]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825–2830.

[31]

Sebastian Raschka. 2018. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. The Journal of Open Source Software 3, 24 (April 2018), 638.

[32]

Ingo Steinwart and Andreas Christmann. 2008. Support Vector Machines (1st ed.). Springer Publishing Company, Incorporated, Springer Publishing Company.

Digital Library

[33]

Sarvar Sultonov. 2023. IMPORTANCE OF PYTHON PROGRAMMING LANGUAGE IN MACHINE LEARNING. International Bulletin of Engineering and Technology 3, 9 (Sep. 2023), 28–30. https://internationalbulletins.com/intjour/index.php/ibet/article/view/1020

[34]

Alaa Tharwat. 2016. Linear vs. quadratic discriminant analysis classifier: a tutorial. International Journal of Applied Pattern Recognition 3, 2 (2016), 145–180.

[35]

Maria Tsiakmaki, Georgios Kostopoulos, Sotiris Kotsiantis, and Omiros Ragos. 2019. Implementing AutoML in educational data mining for prediction tasks. Applied Sciences 10, 1 (2019), 90.

[36]

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods 17, 3 (2020), 261–272.

[37]

Stanislav Vojíř, Vaclav Zeman, Jaroslav Kuchař, and Tomáš Kliegr. 2018. EasyMiner. eu: Web framework for interpretable machine learning based on rules and frequent itemsets. Knowledge-Based Systems 150 (2018), 111–115.

[38]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. 2017. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, NIH Clinical Center, USA, 2097–2106.

[39]

Geoffrey I. Webb. 2010. Naïve Bayes. Springer US, Boston, MA, 713–714.

[40]

Tong Zhang and Vijay S Iyengar. 2002. Recommender systems using linear classifiers. The Journal of Machine Learning Research 2 (2002), 313–334.

Index Terms

Automatic Dataset Type Recognition for Association Rule Mining
1. Information systems
  1. Data management systems
    1. Data structures
  2. World Wide Web
    1. Web applications
    2. Web mining

Recommendations

Optimization of association rule mining queries

Levelwise algorithms (e.g., the APRIORI algorithm) have been proved effective for association rule mining from sparse data. However, in many practical applications, the computation turns to be intractable for the user-given frequency threshold and the ...
Association rule mining and quantitative association rule mining among infrequent items
Efficient Mining of Intertransaction Association Rules

Most of the previous studies on mining association rules are on mining intratransaction associations, i.e., the associations among items within the same transaction where the notion of the transaction could be the items bought by the same customer, the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SETN '24: Proceedings of the 13th Hellenic Conference on Artificial Intelligence

September 2024

437 pages

ISBN:9798400709821

DOI:10.1145/3688671

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2024

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SETN 2024

SETN 2024: 13th Hellenic Conference on Artificial Intelligence

September 11 - 13, 2024

Piraeus, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
19
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)12

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Table of Conten