Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3688671.3688767acmotherconferencesArticle/Chapter ViewAbstractPublication PagessetnConference Proceedingsconference-collections
short-paper

Automatic Dataset Type Recognition for Association Rule Mining

Published: 27 December 2024 Publication History

Abstract

Association Rule Mining is an important subfield of data mining, which consists of extracting interesting associations between items that coexist in transactions on databases. The transactions dataset may be of different types, like (a) a market basket list, where each line represents a transaction, (b) invoice detail, directly derived from ERP company prints, (c) a sparse matrix with as many columns as the different types considered for mining, and (d) nominal attributes, mainly consisting of categorical features. The classification of a given input into the correct dataset type is crucial in automated machine learning tasks. In this paper, we report on the development of an automatic dataset type recognition mechanism. A specialized "Dataset of Datasets" is created from a variety of datasets distributed by well-known repositories. Ultimately, we build a hybrid classification model consisting of a procedural programming component and a pre-trained Supervised Machine Learning model based on the Random Forest algorithm. The classification accuracy achieved is of the order of 98%. The Random Forest algorithm has been chosen after considering a number of popular machine learning algorithms like the Naïve Bayes, Decision Tree, K-Nearest Neighbor (K-NN), SVM, as well as their variants.

References

[1]
David Aha, Dennis Kibler, and Marc Albert. 1991. Instance-Based Learning Algorithms. Machine Learning 6, 1 (1991), 37–66.
[2]
Arthur Asuncion and David Newman. 2007. UCI Machine Learning Repository. https://archive.ics.uci.edu/datasets
[3]
Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Kötter, Thorsten Meinl, Peter Ohl, Kilian Thiel, and Bernd Wiswedel. 2008. KNIME: The Konstanz Information Miner. Studies in classification, data analysis, and knowledge organization 1 (2008), 319–326.
[4]
Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. 2013. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research 14 (2013), 2349–2353. http://jmlr.org/papers/v14/demsar13a.html
[5]
Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matthew D. Hoffman, and Rif A. Saurous. 2017. TensorFlow Distributions. CoRR abs/1711.10604 (2017), 1–13. arXiv:https://arXiv.org/abs/1711.10604http://arxiv.org/abs/1711.10604
[6]
Stephan Dreiseitl and Lucila Ohno-Machado. 2002. Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics 35, 5-6 (2002), 352–359.
[7]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. Advances in neural information processing systems 28 (2015), 2962–2970.
[8]
Python Software Foundation. 2020. csv — CSV File Reading and Writing — Python 3.8.1 documentation. https://docs.python.org/3/library/csv.html
[9]
E. Frank, M. A. Hall, G. Holmes, R. Kirkby, B. Pfahringer, and I. H. Witten. 2005. Weka: A machine learning workbench for data mining.Springer, Berlin, 1305–1314. http://researchcommons.waikato.ac.nz/handle/10289/1497
[10]
Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine learning 29 (1997), 131–163.
[11]
Johannes Fürnkranz. 2010. Decision Tree. Springer US, Boston, MA, 263–267.
[12]
Stephen I Gallant et al. 1990. Perceptron-based learning algorithms. IEEE Transactions on neural networks 1, 2 (1990), 179–191.
[13]
Migran N Gevorkyan, Anastasia V Demidova, Tatiana S Demidova, and Anton A Sobolev. 2019. Review and comparative analysis of machine learning libraries for machine learning. Discrete and Continuous Models and Applied Computational Science 27, 4 (2019), 305–315.
[14]
Gösta Grahne and Jianfei Zhu. 2005. Fast algorithms for frequent itemset mining using fp-trees. IEEE transactions on knowledge and data engineering 17, 10 (2005), 1347–1362.
[15]
Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta. 2011. The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Datasets. Journal of Machine Learning Research 12 (2011), 1977–1981. https://jmlr.csail.mit.edu/papers/v12/hahsler11a.html
[16]
Michael Hahsler, Bettina Gruen, and Kurt Hornik. 2005. arules – A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software 14, 15 (October 2005), 1–25.
[17]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18.
[18]
Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, Vol. 1. IEEE, IEEE, Mondreal, Canada, 278–282.
[19]
Ben Johnson and Anjana S Chandran. 2021. ‘Comparison between Python, Java and R progrmming language in machine learning. Int. Res. J. Modernization Eng. Technol. Sci 3, 6 (2021), 1–6.
[20]
Konstantinos Kelesidis, Nikoletta Fotopoulou, and Dimitris Dervos. 2024. Correlation as an ARM Interestingness Measure for Numeric Datasets. In Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics (, Lamia, Greece,) (PCI ’23). Association for Computing Machinery, New York, NY, USA, 1–7.
[21]
Eshwari Girish Kulkarni and Raj B Kulkarni. 2016. Weka powerful tool in data mining. International Journal of Computer Applications 975 (2016), 8887.
[22]
Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) 2020 (July 2020). https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf
[23]
Richard P Lippmann. 1989. Pattern classification using neural networks. IEEE communications magazine 27, 11 (1989), 47–50.
[24]
[25]
Konstantinos Malliaridis, Stefanos Ougiaroglou, and Dimitris A. Dervos. 2020. WebApriori: A Web Application for Association Rules Mining. In Intelligent Tutoring Systems, Vivekanandan Kumar and Christos Troussas (Eds.). Springer International Publishing, Cham, 371–377.
[26]
W. McKinney. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, California, USA. https://books.google.gr/books?id=v3n4_AK8vu0C
[27]
Costas Neocleous and Christos Schizas. 2002. Artificial Neural Network Learning: A Comparative Review. In Methods and Applications of Artificial Intelligence, Ioannis P. Vlahavas and Constantine D. Spyropoulos (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 300–313.
[28]
FY Osisanwo, JET Akinsola, O Awodele, JO Hinmikaiye, O Olakanmi, J Akinjobi, et al. 2017. Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT) 48, 3 (2017), 128–138.
[29]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Vancouver, Canada. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
[30]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825–2830.
[31]
Sebastian Raschka. 2018. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. The Journal of Open Source Software 3, 24 (April 2018), 638.
[32]
Ingo Steinwart and Andreas Christmann. 2008. Support Vector Machines (1st ed.). Springer Publishing Company, Incorporated, Springer Publishing Company.
[33]
Sarvar Sultonov. 2023. IMPORTANCE OF PYTHON PROGRAMMING LANGUAGE IN MACHINE LEARNING. International Bulletin of Engineering and Technology 3, 9 (Sep. 2023), 28–30. https://internationalbulletins.com/intjour/index.php/ibet/article/view/1020
[34]
Alaa Tharwat. 2016. Linear vs. quadratic discriminant analysis classifier: a tutorial. International Journal of Applied Pattern Recognition 3, 2 (2016), 145–180.
[35]
Maria Tsiakmaki, Georgios Kostopoulos, Sotiris Kotsiantis, and Omiros Ragos. 2019. Implementing AutoML in educational data mining for prediction tasks. Applied Sciences 10, 1 (2019), 90.
[36]
Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods 17, 3 (2020), 261–272.
[37]
Stanislav Vojíř, Vaclav Zeman, Jaroslav Kuchař, and Tomáš Kliegr. 2018. EasyMiner. eu: Web framework for interpretable machine learning based on rules and frequent itemsets. Knowledge-Based Systems 150 (2018), 111–115.
[38]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. 2017. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, NIH Clinical Center, USA, 2097–2106.
[39]
Geoffrey I. Webb. 2010. Naïve Bayes. Springer US, Boston, MA, 713–714.
[40]
Tong Zhang and Vijay S Iyengar. 2002. Recommender systems using linear classifiers. The Journal of Machine Learning Research 2 (2002), 313–334.

Index Terms

  1. Automatic Dataset Type Recognition for Association Rule Mining

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        SETN '24: Proceedings of the 13th Hellenic Conference on Artificial Intelligence
        September 2024
        437 pages
        ISBN:9798400709821
        DOI:10.1145/3688671
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 December 2024

        Check for updates

        Author Tags

        1. Association Rules
        2. AutoML
        3. Dataset Types
        4. Feature Extraction
        5. Supervised Machine Learning

        Qualifiers

        • Short-paper

        Conference

        SETN 2024

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 19
          Total Downloads
        • Downloads (Last 12 months)19
        • Downloads (Last 6 weeks)12
        Reflects downloads up to 12 Feb 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media