Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1570256.1570424acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
tutorial

Large scale data mining using genetics-based machine learning

Published: 08 July 2009 Publication History

Abstract

We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them.
This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for data mining methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms, hardware solutions, parallel models and data-intensive computing. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.

References

[1]
]]http://www.ncbi.nlm.nih.gov/Genbank/index.html
[2]
]]http://www.netflixprize.com/
[3]
]]Jiang, M., Ryu, J., Kiraly, M., Duke, K., Reinke, V., and Kim, S.K., (2001). Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. Proc. Natl. Acad. Sci. USA 98, 218--223
[4]
]]Bernadó, E., Ho, T.K., Domain of Competence of XCS Classifier System in Complexity Measurement Space, IEEE Transactions on Evolutionary Computation, 9: 82--104, 2005.
[5]
]]Physicists brace themselves for lhc 'data avalanche'." www.nature.com/news/2008/080722/full/news.2008.967.html
[6]
]]M. Pop and S. L. Salzberg, "Bioinformatics challenges of new sequencing technology," Trends in Genetics, vol. 24, no. 3, pp. 142 -- 149, 2008
[7]
]]http://www.hdfgroup.org/HDF5
[8]
]]K. Sastry, "Principled Efficiency-Enhancement Techniques", GECCO-2005 Tutorial
[9]
]]A.A. Freitas, "Data Mining and Knowledge Discovery with Evolutionary Algorithms", Springer-Verlag, 2002
[10]
]]J. Bacardit, Pittsburgh Genetics-Based Machine Learning in the Data Mining era: Representations, generalization, and run-time. PhD thesis, Ramon Llull University, Barcelona, Spain, 2004
[11]
]]Jaume Bacardit, David E. Goldberg, Martin V. Butz, Xavier Llorà and Josep M. Garrell, Speeding-up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy, 8th International Conference on Parallel Problem Solving from Nature - PPSN VIII
[12]
]]D. Song, M.I. Heywood and A.N. Zincir-Heywood, Training genetic programming on half a million patterns: an example from anomaly detection, IEEE Transactions on Evolutionary Computation, vol. 9, no. 3, pp 225--239, 2005
[13]
]]Llora, X., Priya, A., and Bhragava, R. (2007), Observer-Invariant Histopathology using Genetics-Based Machine Learning. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2007), 2098--2105
[14]
]]Giráldez R, Aguilar-Ruiz JS, Santos JCR (2005) Knowledge-based fast evaluation for evolutionary learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C 35(2):254--261
[15]
]]J. Bacardit, E. K. Burke, and N. Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing, in press, 2009.
[16]
]]M. V. Butz, P. L. Lanzi, X. Llorà, and D. Loiacono. An analysis of matching in learning classifier systems.In GECCO '08: Proceedings of the 10th annual conference on Genetic and evolutionary computation, pp. 1349--1356. ACM, 2008.
[17]
]]Llorà, X., Sastry, K., Yu, T., and Goldberg, D. E. Do not match, inherit: fitness surrogates for genetics-based machine learning techniques. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp 1798--1805, ACM, 2007
[18]
]]Orriols-Puig, A., Bernadó-Mansilla, E., Sastry, K., and Goldberg, D. E. Substructrual surrogates for learning decomposable classification problems: implementation and first results. 10th International Workshop on Learning Classifier Systems, 2007
[19]
]]J. Bacardit and N. Krasnogor, Performance and Efficiency of Memetic Pittsburgh Learning Classifier Systems, Evolutionary Computation Journal, 17(3):(to appear), 2009
[20]
]]G. Wilson and W. Banzhaf, "Linear genetic programming gpgpu on microsoft's xbox 360," in Proceedings of the 2008 Congress on Evolutionary Computation, pp. 378--385. IEEE Press, 2008
[21]
]]http://www.gpgpgpu.com/
[22]
]]J. Bacardit and N. Krasnogor. "Empirical evaluation of ensemble techniques for a Pittsburgh Learning Classifier System". Learning Classifier Systems. LNAI 4998, pp. 255--268, 2008, Springer
[23]
]]http://www.infobiotic.net/PSPbenchmarks/
[24]
]]J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llorà and N. Krasnogor. Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein Structure Prediction In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO2007), pp. 346--353, ACM Press, 2007
[25]
]]Goldberg, D. E., Sastry, K. and Llora, X. (2007), Toward routine billion-variable optimization using genetic algorithms. Complexity, 12(3), 27--29.
[26]
]]G. Venturini. SIA: A supervised inductive algorithm with genetic search for learning attributesbased concepts. In: Brazdil PB (ed) Machine Learning: ECML-93 - Proc. of theEuropean Conference on Machine Learning, Springer-Verlag, Berlin, Heidelberg, pp 280--296, 1993
[27]
]]J. Rissanen J. Modeling by shortest data description. Automatica vol. 14:465--471, 1978
[28]
]]L. Bull, E. Bernadó-Mansilla and J. Holmes (editors), Learning Classifier Systems in Data Mining. Springer, 2008
[29]
]]Alba, E., Ed. Parallel Metaheuristics. Wiley, 2007.
[30]
]]Cantu-Paz, E. Efficient and Accurate Parallel Genetic Algorithms. Springer, 2000.
[31]
]]Llora, X. E2K: evolution to knowledge. SIGEVOlution 1, 3 (2006), 10--17.
[32]
]]Llora, X. Genetic Based Machine Learning using Fine-grained Parallelism for Data Mining. PhD thesis, Enginyeria i Arquitectura La Salle. Ramon Llull University, Barcelona, February, 2002.RFC2413, The Dublin Core Metadata Initiative, 2008.
[33]
]]Llora, X., Acs, B., Auvil, L., Capitanu, B., Welge, M., and Goldberg, D. E. Meandre: Semantic-driven data-intensive flows in the clouds. In Proceedings of the 4th IEEE International Conference on e-Science (2008), IEEE press, pp. 238--245.
[34]
]]M. Butz, Rule-Based Evolutionary Online Learning Systems: A Principled Approach toLCS Analysis and Design, Studies in Fuzziness and Soft Computing, vol 109. Springer, 2006
[35]
]]Hadoop (http://hadoop.apache.org/core/)
[36]
]]Meandre (http://seasr.org/meandre)
[37]
]]Dean, J. & Ghemawat, S. MapReduce: Simplified Data Processing in Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation,(San Francisco, CA, December, 2004.

Cited By

View all
  • (2022)Automated Hyperparameter Optimization of Gradient Boosting Decision Tree Approach for Gold Mineral Prospectivity Mapping in the Xiong’ershan AreaMinerals10.3390/min1212162112:12(1621)Online publication date: 16-Dec-2022
  • (2020)A comparative study of Distributed Large Scale Data Mining AlgorithmsBSSS Journal of Computer10.51767/jc1102Online publication date: 25-May-2020
  • (2020)Classification of DNA Microarrays Using Deep Learning to identify Cell Cycle Regulated Genes2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP)10.1109/ATSIP49331.2020.9231888(1-5)Online publication date: Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
GECCO '09: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
July 2009
1760 pages
ISBN:9781605585055
DOI:10.1145/1570256

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-intensive computing
  2. efficiency enhancement
  3. evolutionary algorithms
  4. genetics-based machine learning
  5. large-scale datasets
  6. parallel computing

Qualifiers

  • Tutorial

Conference

GECCO09
Sponsor:
GECCO09: Genetic and Evolutionary Computation Conference
July 8 - 12, 2009
Québec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 1,669 of 4,410 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)2
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Automated Hyperparameter Optimization of Gradient Boosting Decision Tree Approach for Gold Mineral Prospectivity Mapping in the Xiong’ershan AreaMinerals10.3390/min1212162112:12(1621)Online publication date: 16-Dec-2022
  • (2020)A comparative study of Distributed Large Scale Data Mining AlgorithmsBSSS Journal of Computer10.51767/jc1102Online publication date: 25-May-2020
  • (2020)Classification of DNA Microarrays Using Deep Learning to identify Cell Cycle Regulated Genes2020 5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP)10.1109/ATSIP49331.2020.9231888(1-5)Online publication date: Sep-2020
  • (2019)DNA Microarray Analysis Using Machine Learning to Recognize Cell Cycle Regulated Genes2019 International Conference on Control, Automation and Diagnosis (ICCAD)10.1109/ICCAD46983.2019.9037868(1-5)Online publication date: Jul-2019
  • (2014)Large-Scale Experimental Evaluation of Cluster Representations for Multiobjective Evolutionary ClusteringIEEE Transactions on Evolutionary Computation10.1109/TEVC.2013.228151318:1(36-53)Online publication date: Feb-2014
  • (2013)Meta-learning for large scale machine learning with MapReduce2013 IEEE International Conference on Big Data10.1109/BigData.2013.6691741(105-110)Online publication date: Oct-2013

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media