IT in Industry, vol. 2, no. 3, 2014
Published online 27-Oct-2014
Detecting Fraud Using Transaction Frequency Data
Roheena Khan, Andrew Clark, George Mohay, Suriadi Suriadi
Science and Engineering Faculty
Queensland University of Technology
Brisbane 4001, Australia
Abstract—Despite all attempts to prevent fraud, it continues
to be a major threat to industry and government. In this paper,
we present a fraud detection method which detects irregular
frequency of transaction usage in an Enterprise Resource
Planning (ERP) system. We discuss the design, development and
empirical evaluation of outlier detection and distance measuring
techniques to detect frequency-based anomalies within an
individual user’s profile, relative to other similar users.
Primarily, we propose three automated techniques: a univariate
method, called Boxplot which is based on the sample’s median;
and two multivariate methods which use Euclidean distance, for
detecting transaction frequency anomalies within each
transaction profile. The two multivariate approaches detect
potentially fraudulent activities by identifying: (1) users where
the Euclidean distance between their transaction-type set is above
a certain threshold and (2) users/data points that lie far apart
from other users/clusters or represent a small cluster size, using
k-means clustering. The proposed methodology allows an auditor
to investigate the transaction frequency anomalies and adjust the
different parameters, such as the outlier threshold and the
Euclidean distance threshold values to tune the number of alerts.
The novelty of the proposed technique lies in its ability to
automatically trigger alerts from transaction profiles, based on
transaction usage performed over a period of time. Experiments
were conducted using a real dataset obtained from the
production client of a large organization using SAP R/3
(presently the most predominant ERP system), to run its
business. The results of this empirical research demonstrate the
effectiveness of the proposed approach.
Keywords—anomaly detection; enterprise resource planning
systems; fraud detection
I.
INTRODUCTION
Enterprise Resource Planning (ERP) systems are one of the
most important IT developments to emerge in the 1990s. More
and more organizations are now adopting ERP systems, with
most of the Fortune 1000 firms having installed ERP systems
to run their businesses [2]. An ERP system is a packaged
software solution that aims to automate and integrate the core
business processes of an organization. Whilst ERP systems
provide numerous benefits to organizations, due to their
complexity they are vulnerable to many internal and external
threats [1].
Since the advent of ERP systems, researchers have
typically focused on fraud prevention rather than detection.
Many publications have discussed fraud prevention approaches
such as role based access control, segregation of duties,
username and passwords, etc in different systems [1], [3] and
Copyright © Authors
[4]. Although many organizations employ fraud prevention
techniques, they only prevent simple kinds of fraud from
occurring and are not enough on their own [5]. Complex fraud
schemes built over time, involving various applications, are
difficult to prevent. Nevertheless only a few publications deal
with fraud detection approaches in ERP systems [6], [7].
Another driver for better fraud detection particularly in ERP
systems, is the shift towards service oriented architectures.
These architectures allow a higher degree of automation of
business processes, which may lead to more cases of fraud as
the number of human checks are reduced and the number of
entry points into the system are increased [8].
The audits conducted by auditors and fraud examiners to
detect fraud in ERP systems are generally very labour intensive
requiring time, effort and resources [9]. They need to have a
good understanding of the business, ERP software and its
features to conduct effective audits. As audits are conducted
periodically, generally once every financial year, fraud is only
detected towards the end of the year. According to the KPMG
fraud survey [23], the average time to detect fraud is 18
months. Automated fraud detection approaches provide a
possibility of real time detection which can be conducted
continuously therefore identifying frauds as soon as they are
perpetrated and reducing the overall financial losses and time
to detect fraud.
An important analysis carried out by auditors is the
investigation of outliers or anomalies in the types of transaction
performed by users, their frequency and the transaction
amounts. In this paper, we propose the use of continuous and
automated outlier detection and distance measuring techniques
to detect frequency-based anomalous behavior. The intention is
to identify activity which may be indicative of financial fraud.
We use the term, transaction type to represent a single activity
in the system, and a transaction profile (tp) to denote a set of
distinct transaction types that one or more users have
performed [10]. A transaction profile may be associated with
one or many users and each user is associated with exactly one
transaction profile (as discussed in [10]). In particular, we
detect per-user anomalies in the frequency of each transaction
type within a transaction profile. We identify such univariate
outliers for each transaction type, using boxplots, a common
graphical outlier detection technique. We also detect per-user
anomalies using a set of transaction types within a transaction
profile - taking into account the entire set of transaction types.
We identify such cases using the Euclidean distance (ED), a
prevalent distance measuring technique and a clustering
algorithm; k-means. The rationale here is to detect cases where
79
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
a combination of individual transaction type frequencies in a
transaction profile, may cause an outlier.
The next section describes the related work in the field. The
paper follows with a discussion of the proposed approach,
using transaction profiles, in Section III. The methodology of
detection of univariate anomalies using Boxplots and
multivariate anomalies using Euclidean distance is presented in
Section III. The dataset, experiments and a discussion of the
results are presented in Section IV. The paper concludes with a
brief discussion on the current work and future directions
presented in Section V.
II. RELATED WORK
Typically outliers are considered as noise or errors in data,
that may need to be discarded; usually in a preliminary step
before carrying out further data analysis. In our case, outliers
may signify users that behave in a suspicious or irregular
manner, and these rare and suspicious events are more
interesting than the frequently occurring ones. Barnett et al.’s
[11] classical definition of an outlier is, “an observation that
appears to deviate markedly from other members of the sample
in which it occurs”. Ngai et al. [12] in their work argue that
there is a lack of research on the application of outlier detection
techniques to fraud detection, perhaps due to the complexity of
detecting outliers. The authors suggest that in the field of fraud
detection, outlier detection is highly suitable for distinguishing
fraudulent data from authentic data, and thus deserves more
investigation [12].
Outlier detection methods are also referred to as anomaly or
novelty detection methods, and have been employed to identify
credit card [13] and telecommunications fraud [14]. For
example, Fawcett et al. [14] propose a rule-based method for
detecting fraudulent usage of cellular phones based on profiling
customer behaviour. Their system, called DC-1, constructs a
fraud detection tool in three stages:
rules are generated to distinguish fraudulent calls from
legitimate ones;
these rules along with a set of templates are used to
build a collection of profiling monitors. Each monitor
examines behaviour based on one learned rule. In other
words, these monitors, profile the typical behaviour of
each account in accordance with a rule, and describe
any deviations from the typical behaviour;
the system finally weighs the monitor outputs and
generates an alarm if there is sufficient evidence of a
fraudulent activity [14].
Clustering algorithms too have been applied to detect
anomalous behavior. Oh and Lee [25] detect anomalies in audit
trail data by profiling the transactions executed by the users.
They propose a method of clustering the activities of
transactions generated by a user and detect anomalies based on
each user’s profiles.
Outlier detection methods have been categorized into
univariate and multivariate techniques. In the next section, we
investigate related univariate statistical and graph-based
methods.
Copyright © Authors
Published online 27-Oct-2014
A. Univariate Outlier Detection Techniques
Perhaps one of the most accepted statistical outlier
detection techniques is the use of standard scores, also known
as Z-scores. These are used to rescale raw data into its
equivalent standard score, that is in accordance with a measure
of the overall data spread (the distribution’s standard deviation)
[11]. For example: given
, let be the mean and
s the standard deviation, an observation x is considered an
outlier, if:
(1)
where k is the outlier threshold, generally a value of 3 or even 4
standard deviations above the mean. The justification is that an
outlier will have a relatively large standard score (z), given that
it will be far from the distribution’s mean (where about 95% of
the data lies), assuming a normal distribution. However,
Shiffler [16] shows that a k value of 3 or 4 precludes the
existence of outliers in samples of size n ≤ 10, or n ≤ 17,
respectively. The Z-score mean and standard deviation
estimates also give a good idea of the data shape.
Barnett and Lewis [11], present a comprehensive review of
statistical outlier detection methods with mathematical proofs
and discordancy tests for detecting outliers, where the
underlying distribution of the data is known (called parametric
techniques). In other words, outliers are observations that
deviate from the model assumptions. Unfortunately, these
methods depend on many assumptions, such as the knowledge
of the distribution, the distribution parameters, the number of
expected outliers and the type (univariate or multivariate) of
expected outliers [17].
In real world datasets, these factors or assumptions are often
not realistic. Consequently, a more robust outlier detection
technique is required. Thus, we adapt a well-known, graphical
method for outlier detection, called the boxplot. Among many
other exploratory data analysis techniques, John Tukey [18],
proposed the concept of a boxplot. Boxplot is a non-parametric
(or distribution-free) technique, which is based on the fivenumber summary: lower extreme, lower quartile, median, upper
quartile and upper extreme. Figure 1 illustrates an example of a
boxplot. The middle line across the box, divides the data into
two halves, indicating the median (see Figure 1). The rectangular
box around the median, depicts the inter-quartile range (IQR),
that is the distance between the 25% percentile (or lower
quartile:
and the 75% percentile (or upper quartile: ), that
is
. From the upper and lower quartile, dashed lines
extend in either directions, called whiskers, representing k times
the interquartile range and stop at the data point closest to this
limit. Points beyond this limit are tagged as outliers. k
corresponds to values of 1.5 and 3, for mild and extreme outliers,
respectively. The boundaries of k are portrayed by the lower and
upper fences, computed as
and
respectively. The values of 1.5 and 3 for k are also
known as inner and outer fences and are selected based on a
normal distribution. However, the authors [19] argue that these k
values have shown to successfully tag outliers in several datasets
(and therefore defined as a non-parametric technique).
80
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
Published online 27-Oct-2014
The Euclidean distance satisfies the following mathematical
distance conditions [20]:
distance is a non-negative number;
distance of an object to itself is zero;
distance is a symmetric function; and
: going directly from object i
to object j in space is no more than making a detour
over any other object h (called the triangular
inequality).
The above conditions also hold true for detecting anomalies
within the transaction frequencies of each transaction type with
a transaction profile.
Fig. 1. Example of a boxplot (adapted from [18]).
On visual inspection, the boxplot shows several data
features such as the: location - shown by the median, spread shown by the length of the box or the quartiles, skewness - for
instance: if the median is much closer to the lower quartile than
to the upper quartile, indicating that the data is positively
skewed, tail length - shown by the points at which the whiskers
stop: determined primarily by the most extreme data values
that are within the outlier cutoffs or fences; and outliers of the
data - depicted by a plus (+) sign outside k [19].
In the next section, we present a detailed review of several
related multivariate statistical and distance-measuring
techniques, along with the motivation for using Euclidean
distance.
B. Multivariate Outlier Detection Techniques
Multivariate techniques are categorized into (a) statistical
methods that are typically parametric (depending on the data
distribution, for example: Mahalanobis distance) and (b) datamining based methods, which are often non-parametric (that is,
they do not rely on the data distribution parameters, for
example: Euclidean distance and k-means clustering
techniques) [17]. We choose non-parametric measures for
detecting multivariate outliers because they do not require the
dataset to have a normal distribution. This section provides a
brief overview of related statistical and data-mining
multivariate outlier detection methods, and present the
motivation for choosing the selected method.
In order to determine whether an observation is an outlier
or not, we incorporate in our proposed anomaly detection
approach, a prevalent distance measure, which does not rely on
the distribution mean and the variance-covariance, called the
Euclidean distance. The Euclidean distance,
,
between
two
p-dimensional
instances:
and
can be calculated
as:
.
Copyright © Authors
Though the main objective of clustering techniques is to
segment data into related clusters, many clustering algorithms
such as k-means and k-mediods are also used for detecting
multivariate outliers [21]. Multivariate outliers are denoted as
data points that lie far apart from any other clusters and/or
represent a small cluster size. Clustering algorithms are
generally non-parametric and therefore do not assume an
underlying data distribution model. They determine clusters
based on a distance measuring function such as the Euclidean
distance. We have selected the k-means method for our
experimental analysis, as it is the most well-known and
commonly-used clustering algorithm. The clustering algorithm
takes two parameters as input - k, the number of clusters or
partitions to be made (needs to be pre-set) and X, the dataset or
the matrix. The algorithm aims at minimizing the sum of the
object-to-centroid distances, thereby making the resulting k
clusters as compact and as separate as possible [20]. In the next
section, we demonstrate the validity and effectiveness of
Boxplot, Euclidean distance and the k-means clustering
technique for detecting univariate and multivariate outliers,
within transaction profiles.
III. ANOMALY DETECTION APPROACH
In order to detect univariate anomalies, we flag users whose
frequency of a particular transaction type is much higher
compared to other users who have performed the same
transaction type within that particular transaction profile. We
identify anomalous transaction frequencies by constructing a
boxplot for each transaction type in a profile, where the
threshold for the anomalous user transaction frequencies tf , are
set with k.
The objective is to flag the most suspicious or highly
unusual usage of transaction types, hence we set the value of k
to 3 (which is recommended for the detection of extreme
outliers that lie outside the outer fence). This implies that no
outliers are detected in transaction profiles with a small number
of users. Shiffler [16] suggests that a boxplot may wrongly
identify some observations as outliers in datasets which have a
small sample size (
). Thus, we decided to set the
minimum number of users in a transaction profile to be greater
than ten. It may be noted, that the anomaly type focuses on the
upper quartile and upper fence only and not the lower quartile.
81
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
Published online 27-Oct-2014
TABLE I.
We detect multivariate anomalies in frequency usage of the
set of transaction type(s) present within a transaction profile.
The Euclidean distance between the frequency of each
transaction type, between each pair of users within a
transaction profile is calculated (where the multiple variables
are the different transaction types). Euclidean distance above a
certain threshold value,
, is used as a criterion
to flag users based on all the transaction types performed and
their frequencies. These users may or may not have univariate
outliers in each feature (that is each transaction type) within a
transaction profile, but the whole observation (set of
transaction type frequencies), may result in a multivariate
outlier. We automatically set a threshold value based on the
mean of the highest distances between users within all
transaction profiles in the dataset. The technique flags pairs of
users within transaction profiles. To find out if one or both
users within a pair of users are anomalous within a transaction
profile, we flag for further investigation user(s) that occur the
most number of times amongst the user pairs which are above
the
(implying that their transaction usage is
different from all others within that profile).
For the multivariate k-means analysis, we detect data points
that lie far apart from any other clusters and/or represent a
small cluster size. Our algorithm groups similar objects
together, based on the principle that objects within a cluster
have higher similarity (based on the Euclidean distance)
amongst themselves in comparison to objects in another
cluster. We visually represent the clusters using Matlab’s
silhouette plots (detailed in Section IV D) to identify the
anomalous users.
For detecting multivariate outliers, we consider transaction
profiles which are associated with at least two transaction
types. These users typically belong to one role as they have
performed the same transaction type set over a period of time.
IV. EXPERIMENTAL RESULTS AND DISCUSSION
The approaches were implemented in Matlab, using the
boxplot and Euclidean distance functions and parameters. All
results were generated and analyzed using Matlab.
A. Dataset
To assess the effectiveness of the proposed method, we
performed experiments on a real dataset collected for a period
of about eight months between the 17th of June 2008 and 16th of
February 2009. The dataset contains 81,047 records and 9,383
users, who have performed 17 different transaction types. All
usernames in the dataset were anonymized. To improve the
overall detection mechanism, transaction profiles with a user
group of more than ten users were selected. Amongst the 68
transaction profiles, 10 profiles were identified with a user
group of ten or more users. These transaction profiles represent
one to three distinct transaction types. The dataset is extracted
from an operational environment, and consists of real users and
activities. We identify anomalous activities alerted by use of
the techniques described above in order to demonstrate the
effectiveness of our approach. A careful examination of the
anomalies is then carried out for verification.
Copyright © Authors
UNIVARIATE OUTLIERS
B. Discussion of Anomalies Detected With Boxplots
Univariate outliers were identified with boxplots for each
transaction type in a profile. Table I summarizes the results of
the experiments – showing the transaction profile id, the
number of users in each transaction profile ( ), the total
number of transaction types ( ), the transaction names of the
transactions in the profile, the total number of records
identified as outliers with boxplot ( ), the most extreme
outliers identified from the visual impression of the boxplots
(
) and the transaction names of the most extreme outliers.
Boxplots were constructed for all transaction profiles with
more than ten users (listed in Table I). The median value (as
portrayed on the boxplots) for all transaction types is around
one, in almost all cases, as the majority of users have
performed the transaction type only once, during the period for
which the data has been extracted (in other words, the box plots
show no lower quartile because the median and the lowest
frequencies are the same or very close to equal). Table I
consists of six unique transaction types: XK01, XK02,
invoice_approved,
requisition,
goods_received
and
conf_approval_1, out of the 17 transaction types in the dataset.
Transaction profiles containing one transaction type are tp
1299, 1296, 1314, 1291 and 1316 (see Table I). From the 9,300
or records in the dataset, the boxplot has tagged 651
univariate outliers, for the six transaction types (as shown in
Table I). These 651 outliers also include a count of all the users
for a particular frequency value. For example in a tp: an outlier
value of frequency 2 may be denoted by a single plus (+) sign
on the boxplot for a particular transaction type, but it may
consist of say, 86 users. An investigator will need to review
each anomaly to understand whether they indicate fraud or not.
Since a transaction type performed twice by 86 users, does not
seem like an anomaly, the investigator can either adjust the
threshold parameters or exclude these from further
investigation. In general, it may be observed that the
transaction profiles with the largest user groups had the highest
82
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
number of outlying values. The threshold value could be
further adjusted for these transaction profiles to obtain the
optimal number of outliers. The overall percentage of
is
reasonably small, contributing to about 7% of the dataset.
However,
equates to a much smaller number of outliers
(7), constituting about 0.07% (7/9300) of the dataset. These
most extreme outliers can be readily identified at a quick
glance of the boxplots. From the total number of records
identified as outliers with the boxplot, the most extreme
outliers identified from the visual impression of the boxplots,
were selected based on the following criteria:
the distance (as observed from the visual impression) or
frequency displayed on the y-axis, from the most
extreme outlier value to its second or next data point in
the boxplot; and
if the transaction frequency value of the outlier is
particularly high.
For an auditor, it is interesting to investigate the most
extreme outlying values from the visual impression of the
boxplots. The most prominent outlying value is observed in
transaction profile 1296. The outlier consists of a user who has
performed only one transaction type (that is, requisition), 56
times, which is significantly higher compared to the remaining
281 users in the profile. Similarly the most extreme outlier in
transaction profile 1314, represents a user who has just
performed configuration approvals, 453 times. The profile
consists of 568 users, and no user except for this particular user
has performed configuration approvals more than 259 times.
Transaction profile 1291, has two outlying values, where both
users have performed invoice approvals, 115 and 95 times,
respectively. Although these frequency values may not
necessarily be regarded as high transaction usage, they appear
distant – and hence anomalous, compared to the other group of
users depicted in the boxplot.
On the contrary, in transaction profiles 1316 and 1320, the
outlying values are close to each other, and thus may not
represent potentially fraudulent activities or perhaps, they may
both be fraudulent. In transaction profile 1297, the most
extreme outlier represents a user who has executed 26 invoice
approvals. Though the boxplot has marked the transaction as an
outlier, the overall low transaction usage of the transaction
types in this profile may suggest that it may not be outlier.
Published online 27-Oct-2014
In the next section, we discuss the multivariate anomalies
detected using Euclidean distance.
C. Discussion of Anomalies Detected With Euclidean
Distance
For the multivariate approach, the dataset is stored in
MySQL for analysis. For manual analysis and investigation of
the flagged users, multiple SQL queries and reports are
generated. Prior to running the distance measuring techniques
we normalize the dataset using the z-score method (discussed
in Section II A).
To improve the overall detection mechanism, we select
transaction profiles that have at least two transaction types and
ten users for our analysis. It may be observed from Table I that
among the ten transaction profiles, only five fulfill the
formulated criteria. Transaction profiles containing one
transaction type - that is tp ids 1299, 1296, 1314, 1291 and
1316 are excluded for the current multivariate analysis (these
profiles were included in the univariate analysis). The
remaining five transaction profiles represent at least two
distinct transaction types (see Table II). Table II presents a
summary of the experimental results – showing the transaction
profile id, the number of users in each transaction profile ( ),
the total number of transaction types ( ), the transaction
names of the transactions in the profile, the maximum
Euclidean distance between any two users (or a pair of users)
within the profile, the mean of all Euclidean distances for each
pair of users, the total number of records identified as
multivariate outliers (
), based on the Euclidean distance
threshold, and for comparison, we have also included the most
extreme outliers identified from the visual impression of the
boxplots (
) from Table I.
The Euclidean distance was calculated for pairs of users
within the five transaction profiles, to detect multivariate
outliers. Based on the mean of the highest Euclidean distances in
each of the profiles (shown in Column 5 of Table II), we set the
Euclidean distance threshold value to 5.3. Consequently, as the
threshold value is increased, the total number of records
identified as multivariate outliers decreased – representing only
the most anomalous users. However, with different datasets,
different numbers of users, transaction types and profiles, it may
be useful for a fraud examiner or an auditor to deduce an
TABLE II.
MULTIVARIATE OUTLIERS
The high transaction usage of a particular transaction type
may imply that the transaction is one of the main
responsibilities defined by the user’s job function or role in the
organization. This can be verified by examining the SAP R/3
system’s user-role and role-transaction type tables. Such users
who have performed only 1 transaction type for the entire
period are interesting to investigate, as they might be valid
users who may have changed their job function or are
promoted, meaning that they are assigned a new role for
accessing the system and the outlying values are transactions
performed with their previous role. Or perhaps they might be
synthetic user ids created by valid users to perform fraudulent
activities - anyway they are anomalous. Evidently, these
anomalous values require an in-depth analysis.
Copyright © Authors
83
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
appropriate threshold value from the highest Euclidean
distances in each transaction profile. In our dataset, the
minimum Euclidean distance between a pair of users, in all five
transaction profiles is zero. This may occur, for example when
users in a transaction profile perform the same two or three
transaction types with the same frequency (in our case a
frequency value of one or two). From the user pairs that have a
greater than 5.3, we identify and flag users that
occur the highest number of times within each transaction
profile. A total of five multivariate outliers are detected using
the Euclidean distance measuring technique. No outliers are
flagged in tp ids 1302 as the highest ED value is below 5.3. In
tp 1297, two users are flagged as they appear the highest
number of times amongst all user pairs which have a Euclidean
distance of 5.3 in this profile. Interestingly, all the other 23
users in the profile have done both the transaction types less
than or equal to 13 times, however the two flagged users have
performed one of the transactions types only once and the
other, 30 and 26 times. These users are suspicious as one of the
transaction types has the highest frequency values in the
profile, whilst the other has only been executed once.
For tp 1326, amongst the 66 user pairs that were above the
, user ’agKRcVoNk’ appeared 19 times,
indicating that, this particular user’s activities are very different
and potentially fraudulent when compared to all other users in
the dataset. On manual analysis of the transaction profile,
compared with all other 579 users in the transaction profile, the
flagged user (anonymized user name: ’agKRcVoNk’), appears
most anomalous (as shown in Table III). Table III, shows the
anonymized username and the transaction frequencies of the
goods received and requisition transaction types. We pick a
sample of three users for demonstration purposes, other users
amongst the 579 in the transaction profile exhibit a similar
behaviour. It may be observed that while three of the users
(’cpRfDYZ0X’, ’U13GQSjxJ’ and ’arGVUHzWg’) have
performed the goods received transaction most during the
period for which the dataset has been extracted, user:
’agKRcVoNk’ is the user who has executed the requisition
transaction the most. One assumption may be that this user’s
main responsibility is to perform the requisition transaction as
part of their job function, however, the frequency usage
appears anomalous and merits further investigation.
The technique flags one user in tp 1327 as anomalous,
where the ED is greater than 5.3. This particular user appears
around 200 times in the user pairs which have a
greater than 5.3. On manual investigation of User
‘w1ElHuBUAp’ we found that the transaction usage pattern
differed considerably compared to other users within the
transaction profile. One particular transaction type has been
performed a lot more times than the other two transaction types
(as depicted in Table IV). Table IV presents the anonymized
username and the transaction frequencies of the goods
received, requisition and invoice approved transactions,
performed by the user. For an auditor or fraud examiner, this
user is perhaps the most interesting or potentially suspicious
due to the extent of their involvement in the total number of
generated user pairs.
Copyright © Authors
Published online 27-Oct-2014
TABLE III.
MULTIVARIATE OUTLIERS IN TP ID 1326
TABLE IV.
MULTIVARIATE OUTLIER IN TP ID 1327
TABLE V.
MULTIVARIATE OUTLIERS IN TP ID 1320
In tp 1320, user ’SBjThyxGU’ is flagged. On manual
analysis of the transaction frequency values of this user, we
found some very unusual behaviour - where the
goods_received transaction is only performed six times, while
the invoice_approved transaction has been performed 724
times - being the highest frequency in this transaction profile
(see Table V). Table V presents the anonymized username and
the transaction frequencies of the goods received and invoice
approved transaction types performed by the user. This user
needs to be investigated by auditors to confirm if the behaviour
is fraudulent or not.
In the next section, we evaluate and discuss the
effectiveness of employing the k-means clustering algorithm
for the detection of multivariate outliers within transaction
profiles.
D. Discussion of Anomalies Detected With K-means
As mentioned earlier, the experiments are conducted using
Matlab’s built-in k-means function, which takes two
parameters as input – k, the number of clusters and X, the
dataset. We select five transaction profiles from the dataset for
our multivariate outlier analysis. For each of these five
transaction profiles: tp’s 1302, 1297, 1326, 1327 1320, which
have at least two transaction types and ten users. We perform
an array of experiments with a different number of pre-selected
clusters (k values). We use Matlab’s silhouette plots to measure
the quality and strength of each particular grouping of records
or cluster, for the k parameter values of 2-5. The silhouette
value for each point in the dataset, is a measure of how similar
that point is to points in its own cluster, compared to points in
other clusters [24]. The silhouette values range from:
+1, representing points that are very distant from
neighboring clusters, through
0, indicating points that are not distinctly in one cluster
or another, to
-1, signifying points that are probably assigned to the
incorrect cluster.
In other words, +1 indicates that the point is well classified
within a cluster - implying that the point is much closer, on
84
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
Published online 27-Oct-2014
average, to the other members in its cluster than to the
members of the neighbouring clusters, 0 indicates that the
object lies between clusters and -1 indicates that the point is
poorly classified within a cluster - that is, on average, the
members of the neighbouring cluster are closer to the point
than the members of its own cluster [15].
In order to create clusters that are as compact as possible,
we repeat the clustering algorithm 500 times using the
’replicates’ parameter in Matlab. This ensures that the
algorithm returns the best results with the optimal cluster, for
each transaction profile.
1) Detecting Anomalies in Transaction Profile 1302
Tp 1302 consists of 11 users who have all performed two
transaction types: XK01 and XK02. Figures 2 (a), (b), (c) and
(d) show the silhouette plots depicting different grouping
structures for k partitions of 2 - 5. The y-axis shows the cluster
number and the x-axis displays the silhouette values. Each
grouping or cluster indicates the number of users in that cluster
(represented by the vertical bars) and their silhouette value
(suggesting how similar a point is to points in its own cluster
compared to points in other clusters). The users that belong to
each cluster in the silhouette plots (cluster structures shown in
Figure 2) are summarized in Table VI (a), (b), (c) and (d),
respectively. At a quick glance of the silhouette plots in Figure
2, it can be seen that a cluster of size 3 suits well transaction
profile 1302 as: (i) there are no negative silhouette values –
which signify that points are wrongly assigned to a cluster, (ii)
most points indicate a silhouette value of around 1, signifying
that the clusters are well separated from each other, and (iii) the
grouping structure efficiently detects a cluster with only one
observation - defining an outlier.
2) Detecting Anomalies in Transaction Profile 1297
Next, we examine transaction profile 1297, which
comprises 25 users who have all performed two transaction
Fig. 2. Silhouette plots depicting k partitions of 2 - 5 in tp 1302.
TABLE VI.
TABLES SHOW USERS IN EACH CLUSTER FOR 4
DIFFERENT GROUPINGS IN TP 1302
Copyright © Authors
Fig. 3. Silhouette plots depicting k partitions of 2 - 5 in tp 1297.
TABLE VII.
TABLES SHOW USERS IN EACH CLUSTER FOR 4
DIFFERENT GROUPINGS IN TP 1297
types: requisition and invoice approvals. The silhouette plots
constructed for k values of 2 to 5 are shown in Figure 3 and the
number of users within each cluster are summarized in Table
VII. Looking at this set of plots, it may be observed that most
users (in Figures 3 (a), (b), (c) and (d)) exhibit similar
behaviour and belong to a single cluster. Perhaps a high
number of users in an individual cluster in all plots, may
suggest normal or legitimate behaviour. Plots (b) and (c), that
is, where the number of clusters is 3 and 4 respectively, show a
few members that are allocated to unrelated clusters. On closer
investigation of the results, we found that the outlying
observations in C1(in plot (a)), C1 (in plot (b)), C3 (in plot (c)),
and C2, C3, (in plot (d)) are exactly the same 3 users. It may be
observed that a cluster of size 5 is well suited for transaction
profile 1297, where the outlying observations are in C1and C3.
On further investigation of these two clusters, we verified that
they indeed indicate abnormal or irregular behaviour, since the
transaction types have been performed with the highest
frequency values within the transaction profile.
3) Detecting Anomalies in Transaction Profile 1326
Transaction profile 1326, consists of 583 users, who have
performed two transaction types: goods received and
requisition. Figure 4 shows the discovered clusters for 2-5
partitions and Table VIII, presents the number of users in each
of the four different groupings. Since the number of users
within this transaction type is higher, we created silhouette
plots for 2 to 10 clusters. However, when the cluster size was
set to ≥5, more and more observations had a negative silhouette
value (therefore, Figure 4 depicts clusters 2 to 5). From the
plots, it can be observed that for this particular transaction
profile, a cluster size of 4 is optimal. From the visual patterns
revealed in the silhouette plots, a similar trend to transaction
profile 1297 may be observed - where most users belong to a
single cluster. Looking at the large number of users in these
single clusters (more than 500 users in each as shown in Table
VIII), it may be assumed that these user activities represent
normal behaviour. As users in the transaction profile have
performed the same transaction types, they only differ in terms
of the usage or frequency of those transaction types.
85
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
Fig. 4. Silhouette plots depicting k partitions of 2 - 5 in tp 1326.
TABLE VIII.
Fig. 6. Silhouette plots depicting k partitions of 2 - 5 in tp 1320.
TABLES SHOW USERS IN EACH CLUSTER FOR 4
DIFFERENT GROUPINGS IN TP 1326
From an auditor’s viewpoint, these small clusters
encompassing about 10 overlapping users across all plots (in
Figure 4), portray potentially suspicious behaviour and merit
closer investigation. On a manual investigation of these
observations in the dataset, we found that these users have
performed both transaction types with exceptionally high
frequencies.
4) Detecting Anomalies in Transaction Profile 1327
This transaction profile is associated with 663 users and
three distinct transaction types, namely goods_received,
invoice_approved and requisition. Similarly, in transaction
profile 1327, most users belong to an individual, large cluster
(C2 as shown in Figure 5 (a)). However, it is interesting to note
that when the value of k is set to 3 another group of around 130
users is fragmented from the larger cluster, implying that there
Fig. 5. Silhouette plots depicting k partitions of 2 - 5 in tp 1327.
TABLE IX.
Published online 27-Oct-2014
TABLES SHOW USERS IN EACH CLUSTER FOR 4
DIFFERENT GROUPINGS IN TP 1327
Copyright © Authors
TABLE X.
TABLES SHOW USERS IN EACH CLUSTER FOR 4
DIFFERENT GROUPINGS IN TP 1320
may be at least two main categories of users in this profile that
differ in their frequency of transaction types. When the number
of partitions is set to 4 (see plot (c)), another small group of
users emerge into a separate cluster.
The results are most interesting when the number of
clusters is set to 5 (in plot (d)), whereby C1, C3 and C4
represent three small groups of users - representing the
characteristics of an outlier. Based on the empirical analysis of
the silhouette plots, the results were identical if k equals 6 - 10.
On a manual verification of the results, we found that amongst
the 3 transaction types, the 4 users in the smallest cluster, C4
(where k = 5), have performed only one transaction type
(invoice approved) with unusually high frequency values (of
around 100), while the other two transactions have been
performed about 5 times. Likewise, the 5 users in C1 and the
12 users in C3, have relatively large frequency values for only
the requisition transaction. Also of note are the 150 users in C2
who have on an average performed all transaction types 30 - 60
times, whilst the 492 users in C5 have predominantly executed
all transaction types with a frequency of 20 or less.
5) Detecting Anomalies in Transaction Profile 1320
Transaction profile 1320, contains the maximum number of
users (1,892) in the dataset, who have performed the goods
received and invoice approval transactions. Parallel to the
previous four cases (that is, tp 1302, 1297, 1326 and 1327),
most members in this profile belong to a single cluster (see
Figure 6). As mentioned earlier, we believe that a single large
cluster may indicate the normal usage or frequency of
transaction types for the particular organizational role,
represented by this transaction profile. The set of generated
clusters (see Table X) are well-defined and the groupings
remain unaffected as the data is further divided (where k = 2 −
4). Two outlying clusters with 9 and 18 users are detected in
the four different grouping structures, with strikingly high
frequency values. Nevertheless, an additional cluster of 93
users occurs when k is set to 5, these users are grouped into a
separate cluster as they have performed transactions with a
86
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
frequency of 25- 30. The users in the largest cluster, C5, have
executed all transactions with a frequency of about 10.
In summary, we observed that the multivariate analysis
using the Euclidean distance and the k-means clustering
algorithm yielded some similar results. For example: lets
consider transaction profile 1297, where user ’SoaJjPi1l’ has
performed two transaction types: requisition and
invoice_approved. This particular user has been flagged for its
frequency usage (26) of the requisition transaction in both
methods: the multivariate Euclidean distance (detected in tp
1297 – listed in Table II, Row 2) and the k-means analysis
(depicted by C1 in Figure 3 (c) and Table VII, Row 1). Also
user ’862BhD247’ who has performed the invoice_approved
transaction type 30 times within the same transaction profile
1297 has been detected in both the multivariate techniques: the
Euclidean distance (detected in tp 1297 – listed in Table II,
Row 2) and the k-means analysis (depicted by C1 in Figure 3
(c) and Table VII, Row 1).
Both multivariate approaches have their pros and cons,
however, we believe that for our dataset, the ED multivariate
method is more suited, primarily for two reasons. (1) With the
ED multivariate analysis, the threshold is automatically set
based on the mean of the highest distances between users
within all transaction profiles in the dataset. While for the
clustering approach, the number of clusters needs to be predetermined and manually set. (2) With the k-means analysis,
the number of pre-determined clusters effects the number and
sensitivity of alerts generated. On the other hand, the ED
approach detected a small number of more sensitive alerts.
V. CONCLUSION AND FUTURE WORK
In this paper, we have presented two main contributions: (a)
the detection of univariate anomalies within transaction profiles
using boxplots and (b) the detection of multivariate anomalies
using Euclidean distance. The experimental results suggested
that the techniques have successfully tagged anomalous,
potentially fraudulent behaviour in the dataset. The flagged
outliers were manually investigated and verified from the
dataset.
An inherent limitation of the approach is that boxplots are
an informal method, that is, they are not statistically verified
for the detection of outliers. Nevertheless, the constraints and
assumptions set by statistical approaches (as mentioned
previously), has made the use of boxplot suitable for many
large practical applications [22].
The multivariate analysis using the ED and the k-means
algorithm also yielded many significant results. An auditor or
fraud examiner would use all three techniques together to
detect different types of anomalies (as each technique has its
own benefits). While each alert detected is significant in its
own right, more interesting users may be flagged in two or
more techniques.
Published online 27-Oct-2014
that most users in the dataset collapsed into a single large
cluster - suggesting that users in the dataset have performed
related activities (a similar phenomenon occurs with all our
clustering experiments). The smaller clusters, with a few
observations represent data points that are far apart from the
vast majority of observations. For an auditor, the users in these
outlying clusters may be interesting to investigate, as they have
performed transaction types with markedly high frequencies,
thus portraying unusual, and potentially suspicious behaviour.
Our future work will focus on incorporating time analysis
into the anomaly detection. At the moment, our transaction
profiles are based on transaction types and frequencies only
without regard for the period during which the transaction
types are performed. This will naturally affect both the nature
of transaction profiles and also the processing involved. It will
provide the benefit of being able to detect much more subtle
differences - possibly anomalies - amongst users.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
In addition, for the k-means clustering technique we
observed a smooth trade-off, where if the number of clusters is
reduced, users quickly collapse into unrelated groups, while if
the number of preselected clusters is increased, users tend to
fall into their own single clusters. In addition, it can be noted
Copyright © Authors
[16]
[17]
87
A. G. Little and P. J. Best, “A framework for separation of duties in an
SAP R/3 environment,” Managerial Auditing Journal, vol. 18, no. 5, pp.
419–430, 2003.
P. Bingi, M. K. Sharma, and J. K. Godla, “Critical issues affecting an
ERP implementation,” Information Systems Management, vol. 16, no. 3,
pp. 7–14, 1999.
Y. F. Musaji, Integrated Auditing of ERP Systems. New York: John
Wiley and Sons, 2002.
J. D. O’Gara, Corporate Fraud: Case Studies in Detection and
Prevention. New Jersy: John Wiley and Sons, 2004.
R. Bolton and D. Hand, “Statistical fraud detection: A review,”
Statistical Science, vol. 17, no. 3, pp. 235–249, 2002.
W. S. Albrecht, C. Albrecht, and C. C. Albrecht, “Current trends in
fraud and its detection,” Information Security Journal: A Global
Perspective, vol. 17, no. 2, pp. 2–12, 2008.
P. J. Best, “Computer assisted auditing techniques,” Queensland
University of Technology, Brisbane, QLD, 2007.
A. Clark, G. Mohay, and P. Best, “Integrated financial fraud detection in
enterprise applications,” Information Security Institute, Queensland
University of Technology, Brisbane, QLD, 2005.
J. T. Wells, Fraud Casebook: Lessons from the Bad Side of Business.
New Jersy: John Wiley and Sons, 2007.
R. Khan, M. Corney, A. Clark, and G. Mohay, “Transaction mining for
fraud detection in ERP Systems,” Industrial Engineering and
Management Systems, vol. 9, no. 2, pp.141-156, 2010.
V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed. England:
Wiley and Sons, 1994.
E.W.T. Ngai, Y. Hu, Y.H. Wong, Y. Chen, and X. Sun, “The application
of data mining techniques in financial fraud detection: A classification
framework and an academic review of literature,” Decision Support
Systems, vol. 50, no. 3, pp.559-569, 2011.
R. J. Bolton and D. J. Hand, “Unsupervised profiling methods for fraud
detection” in Proc. Credit Scoring and Credit Control VII, London,
2001, pp. 5-7.
T. Fawcett and F. Provost, “Adaptive fraud detection,” Data Mining and
Knowledge Discovery, vol. 1, no. 3, pp.291–316,1997.
P. Flach, Machine Learning: The Art and Science of Algorithms that
Make Sense of Data. Cambridge: University Press, 2012.
R. E. Shiffler, “Maximum z scores and outliers,” The American
Statistician, vol. 42, no. 1, pp.79–80, 1988.
E. Acuna and C. Rodriguez, “A meta analysis study of outlier detection
methods in classification,” University of Puerto Rico, Mayaguez, 2004.
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731
IT in Industry, vol. 2, no. 3, 2014
[18] J. W. Tukey, Exploratory Data Analysis. Addison-Wesley, 1977.
[19] D. C. Hoaglin, F. Mosteller, and J. W. Tukey, Understanding Robust
and Exploratory Data Analysis. New York: John Wiley and Sons, 2000.
[20] J. Han and M. Kamber, Data Mining Concepts and Techniques.Morgan
Kaufmann, 3rd ed. MA: Elsevier, 2006.
[21] O. Maimon and L. Rokach, Data Mining and Knowledge Discovery
Handbook, 2nd ed. New York: Springer, 2005.
Copyright © Authors
Published online 27-Oct-2014
[22] M. Juhola, J. Laurikkala, and E. Kentala, “Informal identification of
outliers in medical data,” in 5th Int. Workshop on Intelligent Data
Analysis Medicine and Pharmacology, 2000.
[23] KPMG, “KPMG 2006 fraud survey,” Australia, 2006.
[24] MATLAB, “K-means clustering,” 2010.
[25] S. H. Oh and W. S. Lee, “An anomaly intrusion detection method by
clustering normal user behavior,” Computers and Security, vol. 22, no.
7, pp.596–612, 2003.
88
ISSN (Print): 2204-0595
ISSN (Online): 2203-1731