Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Data engineering for fraud detection

Published: 01 November 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting suspicious transactions is a binary classification problem and therefore many techniques can be applied. Interpretability is however of utmost importance for the management to have confidence in the model and for designing fraud prevention strategies. Moreover, models that enable the fraud experts to understand the underlying reasons why a case is flagged as suspicious will greatly facilitate their job of investigating the suspicious transactions. Therefore, we propose several data engineering techniques to improve the performance of an analytical model while retaining the interpretability property. Our data engineering process is decomposed into several feature and instance engineering steps. We illustrate the improvement in performance of these data engineering steps for popular analytical models on a real payment transactions data set.

    Highlights

    Companies increasingly rely upon data-driven methods for detecting fraud.
    Data engineering is of utmost importance to improve the performance of most machine learning models.
    Our data engineering process is decomposed into several feature and instance engineering steps.
    The benefits of data engineering is illustrated on a payment transactions data set from a large European Bank.

    References

    [1]
    A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, A. Hussain, Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study, IEEE Access 4 (2016) 7940–7957.
    [2]
    F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional spaces, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 2002, pp. 15–27.
    [3]
    A. Atkinson, M. Riani, Robust Diagnostic Regression Analysis, Springer Science & Business Media, 2000.
    [4]
    B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen, Benchmarking state-of-the-art classification algorithms for credit scoring, J. Oper. Res. Soc. 54 (2003) 627–635.
    [5]
    B. Baesens, S. Höppner, W. Verbeke, T. Verdonck, Instance-dependent cost-sensitive learning for detecting transfer fraud, arXiv (2020) preprint arXiv:2005.02488.
    [6]
    A.C. Bahnsen, D. Aouada, A. Stojanovic, B. Ottersten, Feature engineering strategies for credit card fraud detection, Expert Syst. Appl. 51 (2016) 134–142.
    [7]
    L. Barabesi, A. Cerasa, A. Cerioli, D. Perrotta, Goodness-of-fit testing for the newcomb-benford law with application to the detection of customs fraud, J. Bus. Econ. Stat. 36 (2018) 346–358.
    [8]
    O. Barkan, N. Koenigstein, Item2vec: Neural Item Embedding for Collaborative Filtering, 2016, arXiv:1603.04259.
    [9]
    S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2012) 405–425.
    [10]
    S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit card fraud: a comparative study, Decis. Support. Syst. 50 (2011) 602–613.
    [11]
    K. Boudt, P.J. Rousseeuw, S. Vanduffel, T. Verdonck, The minimum regularized covariance determinant estimator, Stat. Comput. 30 (2020) 113–128.
    [12]
    L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees, Wadsworth Int. Group 37 (1984) 237–251.
    [13]
    M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104.
    [14]
    M.R. Brito, E.L. Chávez, A.J. Quiroz, J.E. Yukich, Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection, Statistics & Probability Letters 35 (1997) 33–42.
    [15]
    R.J. Carroll, D. Ruppert, Transformations in regression: a robust analysis, Technometrics 27 (1985) 1–12.
    [16]
    N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
    [17]
    T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794.
    [18]
    A. Dal Pozzolo, O. Caelen, Y.A. Le Borgne, S. Waterschoot, G. Bontempi, Learned lessons in credit card fraud detection from a practitioner perspective, Expert Syst. Appl. 41 (2014) 4915–4928.
    [19]
    L. Davies, U. Gather, The identification of multiple outliers, J. Am. Stat. Assoc. 88 (1993) 782–792.
    [20]
    J. Davis, M. Goadrich, The relationship between precision-recall and ROC curves, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 233–240.
    [21]
    European Central Bank, E (September 2018): Fifth Report on Card Fraud. URL www.ecb.europa.eu/pub/cardfraud/html/ecb.cardfraudreport201809.en.html.
    [22]
    T. Fawcett, ROC graphs: notes and practical considerations for researchers, Mach. Learn. 31 (2004) 1–38.
    [23]
    T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006) 861–874.
    [24]
    A. Fernández, S. Garca, M. Galar, R.C. Prati, B. Krawczyk, F. Herrera, Learning from Imbalanced Data Sets, Springer, 2018.
    [25]
    N.I. Fisher, Statistical Analysis of Circular Data, Cambridge University Press, 1995.
    [26]
    J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Stat. (2001) 1189–1232.
    [27]
    J. Friedman, T. Hastie, R. Tibshirani, et al., Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors), Ann. Stat. 28 (2000) 337–407.
    [28]
    M. Goldstein, S. Uchida, A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data, PLoS One 11 (2016).
    [29]
    A. Grover, J. Leskovec, node2vec: scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
    [30]
    W.L. Hamilton, R. Ying, J. Leskovec, Inductive Representation Learning on Large Graphs, 2017, arXiv:1706.02216.
    [31]
    D.J. Hand, C. Whitrow, N.M. Adams, P. Juszczak, D. Weston, Performance criteria for plastic card fraud detection tools, J. Oper. Res. Soc. 59 (2008) 956–962.
    [32]
    H. He, Y. Bai, E.A. Garcia, S. Li, Adasyn: adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328.
    [33]
    S. Heritier, E. Cantoni, S. Copt, M.P. Victoria-Feser, Robust Methods in Biostatistics, 825, John Wiley & Sons, 2009.
    [34]
    M. Hubert, E. Vandervieren, An adjusted boxplot for skewed distributions, Computational statistics & data analysis 52 (2008) 5186–5201.
    [35]
    S. Jha, M. Guillen, J.C. Westland, Employing transaction aggregation strategy to detect credit card fraud, Expert Syst. Appl. 39 (2012) 12650–12657.
    [36]
    G. Kovács, Smote-variants: a python implementation of 85 minority oversampling techniques, Neurocomputing 366 (2019) 352–354.
    [37]
    W.J. Krzanowski, D.J. Hand, ROC Curves for Continuous Data, Chapman and Hall/CRC, 2009.
    [38]
    S. Lessmann, B. Baesens, H.V. Seow, L.C. Thomas, Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research, Eur. J. Oper. Res. 247 (2015) 124–136.
    [39]
    C.X. Ling, J. Huang, H. Zhang, et al., AUC: a statistically consistent and more discriminating measure than accuracy, in: IJCAI, 2003, pp. 519–524.
    [40]
    F.T. Liu, K.M. Ting, Z.H. Zhou, Isolation forest, in: 2008 Eighth IEEE International Conference on Data Mining, IEEE, 2008, pp. 413–422.
    [41]
    N. Lunardon, G. Menardi, N. Torelli, Rose: A package for binary imbalanced learning, R Journal (2014) 6.
    [42]
    S.M. Lundberg, S.I. Lee, A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems, 2017, pp. 4765–4774.
    [43]
    A. Marazzi, A.J. Villar, V.J. Yohai, Robust response transformations based on optimal prediction, J. Am. Stat. Assoc. 104 (2009) 360–370.
    [44]
    R.A. Maronna, R.D. Martin, V.J. Yohai, M. Salibián-Barrera, Robust Statistics: Theory and Methods (with R), John Wiley & Sons, 2019.
    [45]
    E.W. Ngai, Y. Hu, Y.H. Wong, Y. Chen, X. Sun, The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature, Decis. Support. Syst. 50 (2011) 559–569.
    [46]
    M.J. Nigrini, Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection, 586, John Wiley & Sons, 2012.
    [47]
    C. Phua, V. Lee, K. Smith, R. Gayler, A comprehensive survey of data mining-based fraud detection research, arXiv (2010) preprint arXiv:1009.6119.
    [48]
    F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for comparing classifiers. 5th int, in: Conference on Machine Learning, Kaufman Morgan, San Francisco, 1998, pp. 445–453.
    [49]
    J.R. Quilan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publichers, San Mateo, 1993.
    [50]
    J. Raymaekers, P.J. Rousseeuw, Transforming variables to central normality, arXiv (2020) preprint arXiv:2005.07946.
    [51]
    M. Riani, Robust transformations in univariate and multivariate time series, Econ. Rev. 28 (2008) 262–278.
    [52]
    M.T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?” explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
    [53]
    P.J. Rousseeuw, K.V. Driessen, A fast algorithm for the minimum covariance determinant estimator, Technometrics 41 (1999) 212–223.
    [54]
    P.J. Rousseeuw, M. Hubert, Anomaly detection by robust statistics, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (2018).
    [55]
    P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, 589, John Wiley & Sons, 2005.
    [56]
    P. Rousseeuw, D. Perrotta, M. Riani, M. Hubert, Robust monitoring of time series with application to fraud detection, Econometrics and Statistics 9 (2019) 108–121.
    [57]
    M. Saerens, P. Latinne, C. Decaestecker, Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure, Neural Comput. 14 (2002) 21–41.
    [58]
    T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets, PloS one (2015) 10.
    [59]
    J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers, Psychology Press, 2014.
    [60]
    J.W. Tukey, Exploratory data analysis, vol. 2, 1977, Reading, MA.
    [61]
    V. Van Vlasselaer, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, Gotcha! Network-based fraud detection for social security fraud, Manag. Sci. 63 (2017) 3090–3110.
    [62]
    C. Whitrow, D.J. Hand, P. Juszczak, D. Weston, N.M. Adams, Transaction aggregation as a strategy for credit card fraud detection, Data Min. Knowl. Disc. 18 (2009) 30–55.
    [63]
    B. Zhu, Z. Gao, J. Zhao, S.K. Vanden Broucke, Iric: an r library for binary imbalanced classification, SoftwareX 10 (2019) 100341.

    Cited By

    View all

    Index Terms

    1. Data engineering for fraud detection
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Decision Support Systems
          Decision Support Systems  Volume 150, Issue C
          Nov 2021
          137 pages

          Publisher

          Elsevier Science Publishers B. V.

          Netherlands

          Publication History

          Published: 01 November 2021

          Author Tags

          1. Decision analysis
          2. Payment transactions fraud
          3. Instance engineering
          4. Feature engineering
          5. Cost-based model evaluation

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 26 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Fraud risk assessment in car insurance using claims graph features in machine learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124109251:COnline publication date: 24-Jul-2024
          • (2024)Multi-criteria evaluation of health news storiesDecision Support Systems10.1016/j.dss.2024.114187180:COnline publication date: 1-May-2024
          • (2024)The role of diversity and ensemble learning in credit card fraud detectionAdvances in Data Analysis and Classification10.1007/s11634-022-00515-518:1(193-217)Online publication date: 1-Mar-2024
          • (2023)GUFAD: A Graph-based Unsupervised Fraud Account Detection FrameworkProceedings of the 2023 4th International Conference on Machine Learning and Computer Application10.1145/3650215.3650286(401-406)Online publication date: 27-Oct-2023
          • (2023)Expressing uncertainty in information systems analytics researchInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10313260:1Online publication date: 20-Jan-2023
          • (2023)Automatic Feature Engineering for Learning Compact Decision TreesExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120470229:PAOnline publication date: 13-Jul-2023
          • (2023)FAI: A Fraudulent Account Identification SystemArtificial Intelligence10.1007/978-981-99-9119-8_23(253-257)Online publication date: 22-Jul-2023
          • (2022)A Detection Method for Abnormal Transactions in E-Commerce Based on Extended Data Flow Conformance CheckingWireless Communications & Mobile Computing10.1155/2022/44347142022Online publication date: 1-Jan-2022
          • (2022)Detecting Inconsistencies in Public Bids: An Automated and Data-based ApproachProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3539637.3558230(182-190)Online publication date: 7-Nov-2022
          • (2022)Fraud detection and prevention in e-commerceElectronic Commerce Research and Applications10.1016/j.elerap.2022.10120756:COnline publication date: 1-Nov-2022
          • Show More Cited By

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media