research-article

Data engineering for fraud detection

Authors:

Sebastiaan Höppner,

Tim VerdonckAuthors Info & Claims

Volume 150, Issue C

https://doi.org/10.1016/j.dss.2021.113492

Published: 01 November 2021 Publication History

Abstract

Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting suspicious transactions is a binary classification problem and therefore many techniques can be applied. Interpretability is however of utmost importance for the management to have confidence in the model and for designing fraud prevention strategies. Moreover, models that enable the fraud experts to understand the underlying reasons why a case is flagged as suspicious will greatly facilitate their job of investigating the suspicious transactions. Therefore, we propose several data engineering techniques to improve the performance of an analytical model while retaining the interpretability property. Our data engineering process is decomposed into several feature and instance engineering steps. We illustrate the improvement in performance of these data engineering steps for popular analytical models on a real payment transactions data set.

Highlights

•

Companies increasingly rely upon data-driven methods for detecting fraud.

•

Data engineering is of utmost importance to improve the performance of most machine learning models.

•

Our data engineering process is decomposed into several feature and instance engineering steps.

•

The benefits of data engineering is illustrated on a payment transactions data set from a large European Bank.

References

[1]

A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, A. Hussain, Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study, IEEE Access 4 (2016) 7940–7957.

[2]

F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional spaces, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 2002, pp. 15–27.

[3]

A. Atkinson, M. Riani, Robust Diagnostic Regression Analysis, Springer Science & Business Media, 2000.

[4]

B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen, Benchmarking state-of-the-art classification algorithms for credit scoring, J. Oper. Res. Soc. 54 (2003) 627–635.

[5]

B. Baesens, S. Höppner, W. Verbeke, T. Verdonck, Instance-dependent cost-sensitive learning for detecting transfer fraud, arXiv (2020) preprint arXiv:2005.02488.

[6]

A.C. Bahnsen, D. Aouada, A. Stojanovic, B. Ottersten, Feature engineering strategies for credit card fraud detection, Expert Syst. Appl. 51 (2016) 134–142.

[7]

L. Barabesi, A. Cerasa, A. Cerioli, D. Perrotta, Goodness-of-fit testing for the newcomb-benford law with application to the detection of customs fraud, J. Bus. Econ. Stat. 36 (2018) 346–358.

[8]

O. Barkan, N. Koenigstein, Item2vec: Neural Item Embedding for Collaborative Filtering, 2016, arXiv:1603.04259.

[9]

S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2012) 405–425.

Digital Library

[10]

S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit card fraud: a comparative study, Decis. Support. Syst. 50 (2011) 602–613.

[11]

K. Boudt, P.J. Rousseeuw, S. Vanduffel, T. Verdonck, The minimum regularized covariance determinant estimator, Stat. Comput. 30 (2020) 113–128.

[12]

L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees, Wadsworth Int. Group 37 (1984) 237–251.

[13]

M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104.

Digital Library

[14]

M.R. Brito, E.L. Chávez, A.J. Quiroz, J.E. Yukich, Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection, Statistics & Probability Letters 35 (1997) 33–42.

[15]

R.J. Carroll, D. Ruppert, Transformations in regression: a robust analysis, Technometrics 27 (1985) 1–12.

[16]

N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.

[17]

T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794.

[18]

A. Dal Pozzolo, O. Caelen, Y.A. Le Borgne, S. Waterschoot, G. Bontempi, Learned lessons in credit card fraud detection from a practitioner perspective, Expert Syst. Appl. 41 (2014) 4915–4928.

[19]

L. Davies, U. Gather, The identification of multiple outliers, J. Am. Stat. Assoc. 88 (1993) 782–792.

[20]

J. Davis, M. Goadrich, The relationship between precision-recall and ROC curves, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 233–240.

[21]

European Central Bank, E (September 2018): Fifth Report on Card Fraud. URL www.ecb.europa.eu/pub/cardfraud/html/ecb.cardfraudreport201809.en.html.

[22]

T. Fawcett, ROC graphs: notes and practical considerations for researchers, Mach. Learn. 31 (2004) 1–38.

[23]

T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006) 861–874.

Digital Library

[24]

A. Fernández, S. Garca, M. Galar, R.C. Prati, B. Krawczyk, F. Herrera, Learning from Imbalanced Data Sets, Springer, 2018.

[25]

N.I. Fisher, Statistical Analysis of Circular Data, Cambridge University Press, 1995.

[26]

J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Stat. (2001) 1189–1232.

[27]

J. Friedman, T. Hastie, R. Tibshirani, et al., Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors), Ann. Stat. 28 (2000) 337–407.

[28]

M. Goldstein, S. Uchida, A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data, PLoS One 11 (2016).

[29]

A. Grover, J. Leskovec, node2vec: scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.

Digital Library

[30]

W.L. Hamilton, R. Ying, J. Leskovec, Inductive Representation Learning on Large Graphs, 2017, arXiv:1706.02216.

[31]

D.J. Hand, C. Whitrow, N.M. Adams, P. Juszczak, D. Weston, Performance criteria for plastic card fraud detection tools, J. Oper. Res. Soc. 59 (2008) 956–962.

[32]

H. He, Y. Bai, E.A. Garcia, S. Li, Adasyn: adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328.

[33]

S. Heritier, E. Cantoni, S. Copt, M.P. Victoria-Feser, Robust Methods in Biostatistics, 825, John Wiley & Sons, 2009.

[34]

M. Hubert, E. Vandervieren, An adjusted boxplot for skewed distributions, Computational statistics & data analysis 52 (2008) 5186–5201.

[35]

S. Jha, M. Guillen, J.C. Westland, Employing transaction aggregation strategy to detect credit card fraud, Expert Syst. Appl. 39 (2012) 12650–12657.

[36]

G. Kovács, Smote-variants: a python implementation of 85 minority oversampling techniques, Neurocomputing 366 (2019) 352–354.

Digital Library

[37]

W.J. Krzanowski, D.J. Hand, ROC Curves for Continuous Data, Chapman and Hall/CRC, 2009.

[38]

S. Lessmann, B. Baesens, H.V. Seow, L.C. Thomas, Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research, Eur. J. Oper. Res. 247 (2015) 124–136.

[39]

C.X. Ling, J. Huang, H. Zhang, et al., AUC: a statistically consistent and more discriminating measure than accuracy, in: IJCAI, 2003, pp. 519–524.

[40]

F.T. Liu, K.M. Ting, Z.H. Zhou, Isolation forest, in: 2008 Eighth IEEE International Conference on Data Mining, IEEE, 2008, pp. 413–422.

[41]

N. Lunardon, G. Menardi, N. Torelli, Rose: A package for binary imbalanced learning, R Journal (2014) 6.

[42]

S.M. Lundberg, S.I. Lee, A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems, 2017, pp. 4765–4774.

[43]

A. Marazzi, A.J. Villar, V.J. Yohai, Robust response transformations based on optimal prediction, J. Am. Stat. Assoc. 104 (2009) 360–370.

[44]

R.A. Maronna, R.D. Martin, V.J. Yohai, M. Salibián-Barrera, Robust Statistics: Theory and Methods (with R), John Wiley & Sons, 2019.

[45]

E.W. Ngai, Y. Hu, Y.H. Wong, Y. Chen, X. Sun, The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature, Decis. Support. Syst. 50 (2011) 559–569.

Digital Library

[46]

M.J. Nigrini, Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection, 586, John Wiley & Sons, 2012.

[47]

C. Phua, V. Lee, K. Smith, R. Gayler, A comprehensive survey of data mining-based fraud detection research, arXiv (2010) preprint arXiv:1009.6119.

[48]

F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for comparing classifiers. 5th int, in: Conference on Machine Learning, Kaufman Morgan, San Francisco, 1998, pp. 445–453.

[49]

J.R. Quilan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publichers, San Mateo, 1993.

[50]

J. Raymaekers, P.J. Rousseeuw, Transforming variables to central normality, arXiv (2020) preprint arXiv:2005.07946.

[51]

M. Riani, Robust transformations in univariate and multivariate time series, Econ. Rev. 28 (2008) 262–278.

[52]

M.T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?” explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.

[53]

P.J. Rousseeuw, K.V. Driessen, A fast algorithm for the minimum covariance determinant estimator, Technometrics 41 (1999) 212–223.

[54]

P.J. Rousseeuw, M. Hubert, Anomaly detection by robust statistics, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (2018).

[55]

P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, 589, John Wiley & Sons, 2005.

[56]

P. Rousseeuw, D. Perrotta, M. Riani, M. Hubert, Robust monitoring of time series with application to fraud detection, Econometrics and Statistics 9 (2019) 108–121.

[57]

M. Saerens, P. Latinne, C. Decaestecker, Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure, Neural Comput. 14 (2002) 21–41.

[58]

T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets, PloS one (2015) 10.

[59]

J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers, Psychology Press, 2014.

[60]

J.W. Tukey, Exploratory data analysis, vol. 2, 1977, Reading, MA.

[61]

V. Van Vlasselaer, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, Gotcha! Network-based fraud detection for social security fraud, Manag. Sci. 63 (2017) 3090–3110.

Digital Library

[62]

C. Whitrow, D.J. Hand, P. Juszczak, D. Weston, N.M. Adams, Transaction aggregation as a strategy for credit card fraud detection, Data Min. Knowl. Disc. 18 (2009) 30–55.

[63]

B. Zhu, Z. Gao, J. Zhao, S.K. Vanden Broucke, Iric: an r library for binary imbalanced classification, SoftwareX 10 (2019) 100341.

Cited By

Vorobyev I(2024)Fraud risk assessment in car insurance using claims graph features in machine learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124109251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124109
Zifla ERubini B(2024)Multi-criteria evaluation of health news storiesDecision Support Systems10.1016/j.dss.2024.114187180:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.dss.2024.114187
Paldino GLebichot BLe Borgne YSiblini WOblé FBoracchi GBontempi G(2024)The role of diversity and ensemble learning in credit card fraud detectionAdvances in Data Analysis and Classification10.1007/s11634-022-00515-518:1(193-217)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11634-022-00515-5
Show More Cited By

Index Terms

Data engineering for fraud detection

Index terms have been assigned to the content through auto-classification.

Recommendations

Encoder–decoder graph neural network for credit card fraud detection
Abstract
Credit card fraud is a significant problem, with millions of dollars lost each year. Detecting fraudulent transactions is a challenging task due to the large volume of data and the constantly evolving tactics of fraudsters. Likewise any detection ...
Data mining for credit card fraud: A comparative study

Credit card fraud is a serious and growing problem. While predictive models for credit card fraud detection are in active use in practice, reported studies on the use of data mining approaches for credit card fraud detection are relatively few, possibly ...
Data mining application for cyber credit-card fraud detection system
ICDM'13: Proceedings of the 13th international conference on Advances in Data Mining: applications and theoretical aspects

Since the evolution of the internet, many small and large companies have moved their businesses to the internet to provide services to customers worldwide. Cyber credit card fraud or no card present fraud is increasingly rampant in the recent years for ...

Comments

Information & Contributors

Information

Published In

cover image Decision Support Systems

Decision Support Systems Volume 150, Issue C

Nov 2021

137 pages

ISSN:0167-9236

Issue’s Table of Contents

Copyright © 2021.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vorobyev I(2024)Fraud risk assessment in car insurance using claims graph features in machine learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124109251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124109
Zifla ERubini B(2024)Multi-criteria evaluation of health news storiesDecision Support Systems10.1016/j.dss.2024.114187180:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.dss.2024.114187
Paldino GLebichot BLe Borgne YSiblini WOblé FBoracchi GBontempi G(2024)The role of diversity and ensemble learning in credit card fraud detectionAdvances in Data Analysis and Classification10.1007/s11634-022-00515-518:1(193-217)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11634-022-00515-5
Zhang WZhang YHuang YChen FWang JHu X(2023)GUFAD: A Graph-based Unsupervised Fraud Account Detection FrameworkProceedings of the 2023 4th International Conference on Machine Learning and Computer Application10.1145/3650215.3650286(401-406)Online publication date: 27-Oct-2023
https://dl.acm.org/doi/10.1145/3650215.3650286
Twitchell DFuller C(2023)Expressing uncertainty in information systems analytics researchInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10313260:1Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103132
Roshanski IKalech MRokach L(2023)Automatic Feature Engineering for Learning Compact Decision TreesExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120470229:PAOnline publication date: 13-Jul-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120470
Tian YZhang YChen FWang BWang JMeng X(2023)FAI: A Fraudulent Account Identification SystemArtificial Intelligence10.1007/978-981-99-9119-8_23(253-257)Online publication date: 22-Jul-2023
https://dl.acm.org/doi/10.1007/978-981-99-9119-8_23
Wang YYu WTeng PLiu GXiang D(2022)A Detection Method for Abnormal Transactions in E-Commerce Based on Extended Data Flow Conformance CheckingWireless Communications & Mobile Computing10.1155/2022/44347142022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/4434714
Oliveira GReis AFreitas FCosta LSilva MBrum POliveira SBrandão MLacerda APappa G(2022)Detecting Inconsistencies in Public Bids: An Automated and Data-based ApproachProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3558230(182-190)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3539637.3558230
Rodrigues VPolicarpo Lda Silveira Dda Rosa Righi Rda Costa CBarbosa JAntunes RScorsatto RArcot T(2022)Fraud detection and prevention in e-commerceElectronic Commerce Research and Applications10.1016/j.elerap.2022.10120756:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.elerap.2022.101207
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents