1. Introduction
Financial fraud, broadly defined as the intentional act of deceiving people or entities for monetary gain, is a grave issue that poses threats to individuals, corporations, and the global economic system [
1]. Such deceitful actions often leverage intricate financial terminology and language to hide fraudulent activities. These terms, although crafted for specificity and legal clarity in financial domains, can be twisted to generate ambiguity, complicating fraud detection processes. Given the variety of financial instruments and the multifaceted nature of economic activities, fraud can manifest in numerous ways. This variability demands a flexible and encompassing approach to detection. Specifically, the challenge lies in designing mathematical models that can accurately parse, classify, and interpret the nuances of financial language [
2,
3,
4]. However, current models, as indicated by [
5], exhibit shortcomings in accurately classifying these terms, leading to potential blind spots in fraud detection efforts. It is essential to understand that financial fraud detection is primarily a data-driven exercise. Efficient detection relies on mathematical models that can sift through vast amounts of data, recognize patterns, and flag potential anomalies indicative of fraud. The integrity of financial markets hinges on the capacity to improve these models [
6], ensuring the safety of investments [
7,
8]. As such, the central aim of this paper is to propose a more refined model for the precise classification of financial terms, leading to enhanced fraud detection capabilities.
Natural language processing (NLP) models, such as those noted in [
9,
10,
11,
12], have demonstrated significant utility in various domains. For instance, models employed for sentiment analysis [
13,
14,
15,
16] have achieved breakthroughs in deriving insights from subjective information in text, while those used for text summarization [
17] have demonstrated capabilities to distill complex narratives into concise summaries. However, when these powerful models are applied to the domain of finance, their performance tends to be less impressive due to the specialized and complex nature of financial language. One key issue lies in their inconsistent ability to correctly interpret and classify financial terminologies, which often possess unique connotations that differ from their ordinary language counterparts. Additionally, these models often face challenges in dealing with the subtlety of deceptive wording in financial fraud, largely due to the sophisticated tactics used to disguise fraudulent intent.
Delving into specifics, key challenges with current NLP models significantly hinder their effectiveness in financial term classification and fraud detection. First, existing loss function optimization strategies often result in a slow convergence rate during model training, extending the model training duration and reducing the process’s overall efficiency. Secondly, the extraction precision of key financial terms is a fundamental issue. Current NLP models frequently struggle with correctly identifying and emphasizing these critical terms due to their intricate and domain-specific meanings, leading to potential inaccuracies in detecting fraudulent information. Lastly, the considerable size and complexity of these models pose a significant constraint on their deployability and efficiency in resource-limited environments. These challenges underscore the need for an improved model optimized for financial term classification with enhanced loss function convergence, more accurate extraction of key financial terms, and reduced model complexity for efficient deployment. This research is aimed at addressing these pressing needs in the field of NLP applied to finance. In this work, we present a cutting-edge NLP model tailored for the intricate nuances of the financial domain, aiming to identify and discern potential fraudulent activities embedded within financial texts. At the core of our contributions is the FinChain-BERT, a model crafted explicitly for pinpointing deceptive financial terminologies. Here are the distinguishing features and contributions of this research:
- 1.
FinChain-BERT: An avant-garde model uniquely positioned to recognize financial fraud terms, underscoring our commitment to advancing the precision in the realm of fraud detection.
- 2.
Advanced Optimization: By integrating the Stable-Momentum Adam Optimizer, we have significantly accelerated the convergence of the loss function, enhancing the model’s learning efficiency.
- 3.
Fraud Focus Filter: This specially curated filter zeroes in on vital financial terms, ensuring that the model’s attention is consistently directed towards potentially deceptive indicators.
- 4.
Keywords Loss Function: A novel loss calculation approach that attributes heightened significance to essential financial terms, ensuring the model is finely attuned to subtleties that might otherwise be overlooked.
- 5.
Efficient Model Lightening with Int-Distillation: Through meticulous integer computation and strategic pruning of network layers, we have streamlined the model, bolstering its adaptability and scalability without compromising on performance.
- 6.
Custom-Built Dataset Contribution: Drawing from our meticulous data collection methodology, we have supplemented our research with a high-quality, self-curated dataset, reinforcing the model’s understanding of real-world financial intricacies and scenarios.
Together, these innovations position FinChain-BERT as a formidable asset in the ongoing battle against financial fraud. By elevating the precision of fraud detection, we hope to substantially mitigate the inherent risks tied to financial malpractices, fostering a safer and more transparent financial landscape. The rest of this paper is organized as follows:
Section 2 provides a review of the existing literature on NLP techniques for financial data comprehension.
Section 3 demonstrates the materials used in this paper and the processes employed.
Section 4 details the design of our improved NLP model.
Section 5 presents a comprehensive evaluation of the model’s performance and discusses the experimental results. Finally,
Section 6 concludes the paper and suggests directions for future research.
2. Related Work
Financial fraud detection is an ever-evolving challenge, intricately woven with the increasing complexity of financial systems and the expanding magnitude of data these systems produce. This ever-expanding realm of financial data, ranging from transactional data to textual information such as financial reports, necessitates sophisticated tools for analysis and comprehension.
Natural language processing has risen as a beacon in this storm of data, offering the promise of extracting meaningful insights from raw, unstructured text. In the early days of NLP, the predominant methodologies were rooted in statistical learning and rule-based systems. These approaches were grounded in manually crafted rules or simple statistical measures that would identify patterns in data. While these methods provided foundational insights, their static nature often fell short in detecting sophisticated financial fraud tactics that evolved over time. The limitations of these traditional approaches paved the way for more dynamic and adaptive models based on deep learning and neural networks. These models brought the advantage of learning representations directly from data, negating the need for manual feature engineering. Among the forefront of these models are the Recurrent Neural Networks (RNN) [
18], Long Short-Term Memory (LSTM) networks [
19], and more recently, the Transformer models, which have garnered significant attention for their effectiveness in various NLP tasks.
- 1.
RNNs offered the first glimpse into processing sequences in data, an essential feature for understanding time-bound financial transactions. By retaining memory from previous inputs, RNNs provided a way to establish continuity in data, a crucial aspect for tracking fraudulent activities over a period.
- 2.
LSTMs, an evolution over RNNs, tackled some of the RNN’s inherent limitations, specifically their struggle with long-term dependencies. In the vast temporal landscapes of financial transactions, the ability to remember events from the distant past (such as a suspicious transaction from months ago) can be critical in spotting fraudulent activities.
- 3.
Transformer models, on the other hand, introduced a paradigm shift by eliminating the sequential nature of processing data. Their emphasis on attention mechanisms enabled them to capture relationships in data irrespective of the distance between elements, offering a robust model for detecting intricate fraud patterns spread across vast datasets.
In this section, we will delve deeper into the mathematical intricacies of these models, exploring their working principles. More importantly, we will contextualize their applications in the realm of financial fraud detection, highlighting both their successes and the challenges that persist.
2.1. Recurrent Neural Network and Its Relevance in Financial Fraud Detection
RNNs, with their unique capability to handle sequential data, found rapid adoption in various financial tasks. Their innate ability to memorize past data sequences made them particularly relevant for tasks such as predicting stock prices, analyzing market trends, and crucially, fraud detection. Financial transactions, by their very nature, follow a sequence. A suspicious transaction often is not an isolated event; it is typically a culmination of a series of prior transactions. RNNs, with their memory function, can trace back these transactions, allowing analysts to identify patterns that might allude to fraudulent activities. For instance, a sudden surge in transaction volume, when observed in the context of past activities, can be a flag indicating potential fraud.
The capability of computing sequential data of RNN is accomplished through the following recursive formulas:
The represents the hidden state at time step t, is the input at time step t, is the output, W and b are model parameters, and is an activation function such as sigmoid or tanh.
In the realm of fraud detection, RNNs have been successfully deployed in monitoring banking transactions in real-time. They have shown a marked improvement in detection accuracy over traditional rule-based systems, especially when dealing with sophisticated fraud tactics that slowly siphon money over an extended period. The model’s capability to remember past sequences, and consequently, recognize anomalous behavior patterns, has been instrumental in these successes. However, the same memory function of RNNs, while being a strength, is also a source of its primary challenge. As the sequences grow longer, RNNs suffer from vanishing and exploding gradient problems. In financial contexts, where data sequences span months or even years, this becomes a tangible concern. This means that in scenarios where recognizing a long-term pattern is critical to detect fraud, RNNs might fall short. Furthermore, their sequential nature means that processing vast datasets can be time consuming, which is less than ideal in real-time fraud detection scenarios where timely intervention is paramount.
In conclusion, while RNNs have significantly advanced the capabilities of fraud detection systems, their inherent limitations necessitate further improvements or hybrid models that combine the strengths of various neural network architectures to better cater to the dynamic landscape of financial fraud.
3. Materials
3.1. Data Collection
Data for this research were primarily derived from two sources. The first component of the dataset utilized in this research originates from a competition on the determination of negative financial information and its corresponding entities, co-hosted by the China Computer Federation (CCF) and the National Internet Emergency Center [
21], as shown in
Figure 1. The competition primarily consists of two subtasks:
- 1.
Negative Information Determination: Given a piece of financial text and a list of financial entities appearing in the text, the first task is to ascertain whether the text contains negative information regarding the financial entities. If the text does not encapsulate any negative details, or although containing negativity it does not pertain to any financial entities, the outcome is determined as 0.
- 2.
Entity Determination: Upon the detection of negative financial information concerning an entity in the first subtask, this task further discerns which entities from the list are the subjects of the negative information.
To ensure the reliability and precision of the data, the organizers of the competition invited professionals for standardized data annotation. This annotation process rigorously followed high-quality standards, aligning with principles of data ethics to guarantee annotation consistency, accuracy, and reliability. Additionally, this dataset has been reviewed and meets ethical requirements, encompassing participant privacy protection, and ensuring data transparency and fairness. The data, sourced for this research, are openly accessible and have been used in strict adherence to copyright norms. We have made certain that all procedures of data procurement and processing are within ethical permissions and respect the rights of the original data providers.
In
Figure 1, we present examples of the complaint texts that serve as the input data,
, for training our model. Each complaint text is paired with an output label,
, indicating the category of the complaint such as “fabrication”, “remittance”, or “realization”. The specific mapping rule of the given example in
Figure 1 is shown in
Table 1.
Our model can be described by the function , where is the machine learning algorithm trained to map complaint texts (X) to their respective categories (Y), and e is the error term capturing any deviations from this mapping.
The dataset provided is split into a training set and a test set, with each set containing 10,000 instances, making a total of 20,000 data points. Each data instance in this Chinese natural language text dataset represents financial network text, comprising a title and content. Associated with each text are the entities that appear within it. All training data are given in CSV format. Each data includes ID, title, text, entity, negative, and key entity. Different fields are separated by English commas, and entities in the same field are separated by English semicolons. Either title or text fields must be filled. Some of the Weibo-sourced data have the same “title” and “text”. The value of “negative” is 0 or 1. 1 stands for negative, and 0 for non-negative. Leveraging the public competition dataset targeting event subject extraction in the financial sector carries significant advantages for this research. Initially, this dataset is specifically extracted for the financial sector, thus being highly relevant to the subject and content of the research. This dataset encompasses numerous significant events in the financial sector, containing many practical cases, aiding the comprehension and mastery of event features and laws in the financial sector. Furthermore, professional personnel have curated and annotated this dataset, ensuring its high quality and consistency, which is vital for model training and verification.
Apart from the open dataset, a large volume of text data was collected from the internet, primarily consisting of news reports, blog articles, social media posts, etc., encompassing numerous themes and directions in the financial sector. The content, source, and annotation standards and methods of this self-built data were aligned with the aforementioned competition’s open dataset to ensure consistency and high quality across all data points. Data were obtained from various news websites, financial information sites, and social media platforms using web scraping techniques. Subsequent to the collection, the data underwent a thorough cleaning and pre-processing phase, which entailed the removal of irrelevant content, duplicate data, and the normalization of text. This ensured the cleanliness and quality of the data were maintained. Necessary data annotation was conducted in line with the standards set by the open dataset, facilitating seamless model training and testing. It is crucial to note that throughout the data collection from the internet, strict adherence was maintained to data and privacy protection laws and regulations. Respecting each data source website’s user agreements was paramount, ensuring that all data collection practices were legal and compliant.
In conclusion, by integrating the public competition dataset and internet-collected data, a comprehensive dataset was constructed with both professional, standard data, and real-time, multi-perspective data, significantly propelling this research. During the model’s training and testing process, the effectiveness and quality of this dataset have been fully validated.
4. Proposed Method
4.2. The FinChain-BERT Model
In the realm of financial fraud detection, the BERT model, recognized for its powerful capabilities in natural language processing, is often widely employed. However, as the BERT model primarily focuses on the linear relationship between words in the processing of input text, it often fails to capture complex semantic relationships fully. Consequently, a novel model, the FinChain-BERT model, is introduced herein.
The core innovation of the FinChain-BERT model is the dissection of input text into a tree-like structure, named “Chains”, for information propagation and learning, as shown in
Figure 2. This structure allows the model to comprehend complex semantic relationships in the input text more accurately, thereby enhancing the model’s precision in financial fraud detection. In the FinChain-BERT model, input text is initially segmented into several “Chains”, with each “Chain” being a sequence of words representing a clause or a semantic unit from the text. A BERT model is run separately on each “Chain”, acquiring the context representation for each word. Following this, the context representations of all “Chains” are amalgamated to form a comprehensive context representation. In this process, the tree-like structure of the “Chains” ensures the interchange of information between different “Chains”, aiding the model’s understanding of intricate semantic relationships.
Specifically, suppose the input text is segmented into
N “Chains”, each “Chain” is denoted as a sequence of words
, where
represents the length of the
ith “Chain”. For each “Chain”, a BERT model is employed to obtain the context representation of each word
, where
represents the context of word
. Following this, the context representations of all “Chains” are amalgamated into a comprehensive context representation
. During the consolidation of all “Chains” context representations, a special attention mechanism is incorporated to calculate the weight of each “Chain”. This mechanism, termed “Chains Attention”, differs from traditional attention mechanisms as it considers not only the relationship between “Chains” but also their relevance to the specific task. The calculation formula for “Chains Attention” is as follows:
where
Q,
K, and
V are the query matrix, key matrix, and value matrix, respectively, while
denotes the dimension of the key vector. The role of the attention mechanism is to calculate a weighted sum of the input value matrix, with the weights derived from the dot product of the query and key matrices.
In the proposed model, the query matrix
Q, key matrix
K, and value matrix
V are all generated from the context representations of all “Chains”
H. The query and key matrices calculate the weight of each “Chain”, while the value matrix generates the final context representation. The generation process is detailed below:
where
,
, and
are weight matrices, while
,
, and
are bias terms, all learned by the model. The concept behind the FinChain-BERT model is rooted in a profound understanding of the complex semantic relationships within the text. It is acknowledged that in financial fraud detection tasks, not all information is crucial. Instead, it is these complex semantic relationships that are key determinants of model performance. Hence, a mechanism allowing the model to grasp these intricate semantic relationships is required.
In practical application, the FinChain-BERT model demonstrates significant advantages. Firstly, it effectively elevates the model’s accuracy in recognition. By processing tree-like structure “Chains”, the model gains a more accurate understanding of complex semantic relationships in the input text, thereby more accurately identifying fraudulent financial activities. Secondly, it enhances the model’s interpretability. By analyzing the weight of “Chains Attention”, it becomes clear which “Chains” the model pays more attention to, greatly assisting in the understanding of the decision-making process. Finally, the FinChain-BERT model also improves the model’s stability. As the model’s attention is more focused, the output becomes more robust to minor variations in the input.
5. Results and Discussion
5.1. Fraud Detection Results
The primary objective of this experiment design is to validate the superiority of the proposed FinChain-BERT model through a comparative performance evaluation on the fraud detection task. Precision, recall, and accuracy serve as key metrics in this experiment.
From
Table 2, it is observed that the FinChain-BERT model achieved superior performance on all metrics. Specifically, precision, recall, and accuracy of FinChain-BERT reached 0.97, 0.96, and 0.97, respectively, surpassing the performance of other models. This indicates the excellent capability of FinChain-BERT in fraud detection tasks. Subsequent analysis is conducted from the perspectives of model features and mathematical theory.
Initial observations showed that the performance of RNN and LSTM models was relatively unsatisfactory. This is mainly due to the issues of vanishing and exploding gradients when these models process long sequence data. Despite LSTM’s ability to alleviate these problems to some extent through its gating mechanism, its performance still lags behind models based on the Transformer when dealing with complex text data. Additionally, the incapability of RNN and LSTM to effectively utilize global textual information, as they can only process text in a sequential manner, also limited their performance. On the other hand, Transformer-based models such as BERT, RoBERTa, ALBERT, and DistillBERT outperformed RNN and LSTM. This can be attributed to the Transformer’s ability to process all inputs in parallel, capturing global textual information, and enhancing the model’s ability to grasp key information through its self-attention mechanism. Specifically, BERT, by employing the masked language model and next sentence prediction as pre-training tasks, enables a more comprehensive understanding of text context and semantic relationships, thereby improving model performance. However, BERT, RoBERTa, ALBERT, and DistillBERT models, while processing text, rely predominantly on local lexical information and global information from the entire text, but insufficiently handle complex relationships between words (such as synonyms, antonyms, hyponyms, etc.), which might limit their performance in certain tasks. Ultimately, the FinChain-BERT model demonstrated the best results. This can be mainly attributed to its amalgamation of BERT’s strengths and the advantages of chain structure. The chain structure allows for an explicit representation and utilization of complex relationships between words, providing the model with more semantic information and better understanding and processing of the text. Furthermore, the chain structure’s design allows the model to more effectively handle complex language phenomena, such as ambiguity and metaphor. From a mathematical perspective, the chain structure, by explicitly illustrating the relationships between words, adds additional constraints to the model input. This helps the model to learn the data distribution more effectively, thereby enhancing model performance.
In conclusion, the experiment results validate the superior performance of FinChain-BERT in fraud detection tasks, primarily attributable to its unique chain structure and Transformer-based model design. This underscores the significant value of utilizing structural textual information and complex relationships between words when processing intricate text tasks.