Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Exploiting domain knowledge to address class imbalance and a heterogeneous feature space in multi-class classification

Published: 27 February 2023 Publication History

Abstract

Real-world data of multi-class classification tasks often show complex data characteristics that lead to a reduced classification performance. Major analytical challenges are a high degree of multi-class imbalance within data and a heterogeneous feature space, which increases the number and complexity of class patterns. Existing solutions to classification or data pre-processing only address one of these two challenges in isolation. We propose a novel classification approach that explicitly addresses both challenges of multi-class imbalance and heterogeneous feature space together. As main contribution, this approach exploits domain knowledge in terms of a taxonomy to systematically prepare the training data. Based on an experimental evaluation on both real-world data and several synthetically generated data sets, we show that our approach outperforms any other classification technique in terms of accuracy. Furthermore, it entails considerable practical benefits in real-world use cases, e.g., it reduces rework required in the area of product quality control.

References

[1]
Agard B and Kusiak A Data-mining-based methodology for the design of product families Int. J. Prod. Res. 2004 42 15 2955-2969
[2]
Akhand, M.A.H., Murase, K.: Neural network ensemble training by sequential interaction. In: Proceedings of the 17th International Conference on Artificial Neural Networks, LNCS, pp. 98–108. Springer, Porto, Portugal (2007).
[3]
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, and García S KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework Multiple-Valued Logic Soft Comput. 2011 17 2–3 255-287
[4]
Bach, S.H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., Kuchhal, R., Ré, C., Malkin, R.: Snorkel Drybell: A case study in deploying weak supervision at industrial scale. In: Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pp. 362–375. Amsterdam, The Netherlands (2019).
[5]
Baggio G, Corsini A, Floreani A, Giannini S, and Zagonel V Gender medicine: a task for the third millennium Clin Chem Lab Med 2013 51 4 713-727
[6]
Breiman L Random forests Mach Learn 2001 45 1 5-32
[7]
Chan S, Reddy V, Myers B, Thibodeaux Q, Brownstone N, and Liao W Machine learning in dermatology: current applications, opportunities, and limitations Dermatol Therapy 2020 10 3 365-386
[8]
Cheng Y, Chen K, Sun H, Zhang Y, and Tao F Data and knowledge mining with big data towards smart production J. Ind. Inf. Integr. 2017 9 66
[9]
Cowell, F.: Measuring Inequality, 3rd edn. Oxford Academic (2011).
[10]
Fernández A, López V, Galar M, del Jesus MJ, and Herrera F Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches Knowl. Based Syst. 2013 42 97-110
[11]
Fitzpatrick TB The validity and practicality of sun-reactive skin types I through VI Arch. Dermatol. 1988 124 6 869-871
[12]
Galar M, Fernández A, Barrenechea E, Bustince H, and Herrera F An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes Pattern Recognit. 2011 44 8 1761-1776
[13]
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 42(4), 463–484 (2012).
[14]
Gerling, A., Schreier, U., Hess, A., Saleh, A., Ziekow, H., Ould Abdeslam, D.: A reference process model for machine learning aided production quality management. In: Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020), pp. 515–523. Prague, Czechia (2020).
[15]
Gini C Measurement of inequality of incomes Econ J 1921 31 121 124-126
[16]
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, and Bing G Learning from class-imbalanced data: review of methods and applications Expert Syst. Appl. 2017 73 220-239
[17]
He H and Garcia EA Learning from imbalanced data IEEE Trans. Knowl. Data Eng. 2009 21 9 1263-1284
[18]
Hirsch V, Reimann P, Kirn O, and Mitschang B Analytical approach to support fault diagnosis and quality control in end-of-line testing Procedia CIRP 2018 72 1333-1338
[19]
Hirsch, V., Reimann, P., Mitschang, B.: Data-driven fault diagnosis in end-of-line testing of complex products. In: Proceedings of the 6th IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 492–503. IEEE (2019).
[20]
Hirsch V, Reimann P, and Mitschang B Approach to incorporate cost aspects into the ordering of a data-driven recommendation list for end-of-line testing Procedia CIRP 2020 74 747-752
[21]
Hirsch V, Reimann P, and Mitschang B Exploiting domain knowledge to address multi-class imbalance and a heterogeneous feature space in classification tasks for manufacturing data PVLDB 2020 13 12 3258-3271
[22]
Hu S, Zhu X, Wang H, and Koren Y Product variety and manufacturing complexity in assembly systems and supply chains CIRP Ann. 2008 57 1 45-48
[23]
Humphreys G Coming together to combat rare diseases Bull. World Health Organ. 2012 90 6 401-476
[24]
Jablonski N The evolution of human skin and skin color Ann. Rev. Anthropol. 2004 33 585-623
[25]
Kassner, L., Mitschang, B.: Exploring text classification for messy data: an industry use case for domain-specific analytics technology. In: Proceedings of the 19th International Conference on Extending Database Technology (EDBT), pp. 491–502. Bordeaux, France (2016).
[26]
Kiefer, C., Reimann, P., Mitschang, B.: A hybrid information extraction approach exploiting structured data within a text mining process. In: Proceedings of the 18th Conference on Datenbanksysteme für Business, Technologie und Web (BTW), pp. 149–168. Rostock, Germany (2019).
[27]
Köksal G, Batmaz I, and Testik MC A review of data mining applications for quality improvement in manufacturing industry Expert Syst Appl. 2011 38 10 13448-13467
[28]
Kursa MB and Rudnicki WR Feature selection with the Boruta package J. Stat. Softw. 2010 36 11 66
[29]
Leevy JL, Khoshgoftaar TM, Bauder RA, and Seliya N A survey on addressing high-class imbalance in big data J. Big Data 2018 5 42 66
[30]
Liu, Y., Jin, R., Jain, A.: BoostCluster: boosting clustering by pairwise constraints. In: Proceedings of the 13th International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 450–459. San Jose, CA, USA (2007).
[31]
Mehrabi N, Morstatter F, Saxena N, Lerman K, and Galstyan A A survey on bias and fairness in machine learning ACM Comput. Surv. 2021 54 66
[32]
Mehrpouya M, Dehghanghadikolaei A, Fotovvati B, Vosooghnia A, Emamian SS, and Gisario A The potential of additive manufacturing in the smart factory industrial 4.0: a review Appl. Sci. 2019 9 18 66
[33]
Nanni L, Lumini A, and Brahnam S A classifier ensemble approach for the missing feature problem Artif. Intell. Med. 2012 55 1 37-50
[34]
Polikar R Ensemble based systems in decision making IEEE Circuits Syst. Mag. 2006 6 3 21-45
[35]
Polikar, R., DePasquale, J., Syed Mohammed, H., Brown, G., Kuncheva, L.I.: Learn++.MF: a random subspace approach for the missing feature problem. Pattern Recognit. 43(11), 3817–3832 (2010).
[36]
Quillian R Word concepts. A theory and simulation of some basic semantic capabilities Behav. Sci. 1967 12 410-430
[37]
Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, and Ré C Snorkel: rapid training data creation with weak supervision VLDB J. 2020 29 709-730
[38]
Rokach L Ensemble-based classifiers Artif. Intell. Rev. 2010 33 1–2 1-39
[39]
Silla CN and Freitas AA A survey of hierarchical classification across different application domains Data Min Knowl Discov 2011 22 1–2 31-72
[40]
Sowa, J.F.: Principles of Semantic Networks. Explorations in the Representation of Knowledge. Representation and Reasoning. Morgan Kaufmann (1991)
[41]
Sun C, Rampalli N, Yang F, and Doan A Chimera: large-scale classification using machine learning, rules, and crowdsourcing PVLDB 2014 7 13 1529-1540
[42]
Sun Y, Wong A, and Kamel M Classification of imbalanced data: a review Int. J. Pattern Recognit. Artif. Intell. 2009 23 04 687-719
[43]
Sun Z, Song Q, Zhu X, Sun H, Xu B, and Zhou Y A novel ensemble method for classifying imbalanced data Pattern Recognit. 2015 48 5 1623-1637
[44]
Suresh, H., Guttag, J.: A framework for understanding sources of harm throughout the machine learning life cycle. In: Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO) (2021).
[45]
Thalmann, S., Gursch, H.G., Suschnigg, J., Gashi, M., Ennsbrunner, H., Fuchs, A.K., Schreck, T., Mutlu, B., Mangler, J., Kappl, G., Huemer, C., Lindstaedt, S.: Cognitive decision support for industrial product life cycles: a position paper. In: Proceedings of the 11th International Conference on Advanced Cognitive Technologies and Applications (COGNITIVE). IARIA, Venice, Italy (2019)
[46]
Treder-Tschechlov, D., Reimann, P., Schwarz, H., Mitschang, B.: Approach to synthetic data generation for imbalanced multi-class problems with heterogeneous groups. In: Proceedings of the 20th Conference on Datenbanksysteme für Business, Technologie und Web (BTW). Dresden, Germany (2023)
[47]
Verron S, Li J, and Tiplica T Fault detection and isolation of faults in a multivariate process with Bayesian network J. Process Control 2010 20 8 902-911
[48]
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained K-means clustering with background knowledge. In: Proceedings of the 18th International Conference on Machine Learning (ICML), pp. 577–584. Williamstown, MA, USA (2001)
[49]
Wang S, Minku LL, and Yao X A systematic study of online class imbalance learning with concept drift IEEE Trans. Neural Netw. Learn. Syst. 2018 29 10 4802-4821
[50]
Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybernet. B Cybernet. 42(4), 1119–1130 (2012).
[51]
Weber, C., Hirmer, P., Reimann, P.: A model management platform for industry 4.0— enabling management of machine learning models in manufacturing environments. In: Proceedings of the 23rd International Conference on Business Information Systems (BIS), pp. 403–417 (2020).
[52]
Whitley, H.P., Smith, W.D.: Sex-based differences in medications for heart failure. The Lancet 394(10205), 1210–1212 (2019).
[53]
Wilhelm, Y., Schreier, U., Reimann, P., Mitschang, B., Ziekow, H.: Data science approaches to quality control in manufacturing: a review of problems, challenges and architecture. In: Proceedings of the 14th Symposium on Service-Oriented Computing (SummerSOC), Communications in Computer and Information Science (CCIS), pp. 45–65. Springer (2020).
[54]
Woźniak M, Graña M, and Corchado E A survey of multiple classifier systems as hybrid systems Inf. Fusion 2014 16 3-17
[55]
Wuest T, Weimer D, Irgens C, and Thoben KD Machine learning in manufacturing: advantages, challenges, and applications Prod. Manuf. Res. 2016 4 1 23-45
[56]
Zhou, Z.H., Liu, X.Y.: On multi-class cost-sensitive learning. In: Proceedings of the 21st National Conference on Artificial Intelligence—Vol. 1 (AAAI’06), pp. 567–572. AAAI Press, Boston, MA, USA (2006)

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 32, Issue 5
Sep 2023
222 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 27 February 2023
Accepted: 06 January 2023
Revision received: 21 December 2022
Received: 16 February 2022

Author Tags

  1. Classification
  2. Domain knowledge
  3. Multi-class imbalance
  4. Heterogeneous feature space

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media