Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
39 views

Fraud Detection in Python Chapter3

This document discusses using unsupervised learning techniques like clustering algorithms to perform fraud detection when labeled data is unavailable. It describes how to use k-means clustering and identify normal versus abnormal behavior by segmenting customers into groups and flagging transactions far from the cluster centroids as potentially fraudulent. The document also introduces other clustering methods like DBSCAN and discusses validating potential fraud cases with domain experts.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Fraud Detection in Python Chapter3

This document discusses using unsupervised learning techniques like clustering algorithms to perform fraud detection when labeled data is unavailable. It describes how to use k-means clustering and identify normal versus abnormal behavior by segmenting customers into groups and flagging transactions far from the cluster centroids as potentially fraudulent. The document also introduces other clustering methods like DBSCAN and discusses validating potential fraud cases with domain experts.

Uploaded by

Fgpeqw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Normal versus
abnormal behaviour

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Fraud detection without labels


Using unsupervised learning to distinguish normal from abnormal
behaviour
Abnormal behaviour by definition is not always fraudulent
Challenging because difficult to validate
But...realistic because very often you don't have reliable labels
DataCamp Fraud Detection in Python

What is normal behaviour?


Thoroughly describe your data: plot histograms, check for outliers,
investigate correlations and talk to the fraud analyst
Are there any known historic cases of fraud? What typifies those
cases?
Normal behaviour of one type of client may not be normal for another
Check patterns within subgroups of data: is your data homogenous?
DataCamp Fraud Detection in Python

Customer segmentation: normal behaviour within


segments
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Refresher on clustering
methods

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Clustering: trying to detect patterns in data


DataCamp Fraud Detection in Python

K-means clustering: using the distance to cluster centroids


DataCamp Fraud Detection in Python

K-means clustering: using the distance to cluster centroids


DataCamp Fraud Detection in Python

K-means clustering: using the distance to cluster centroids


DataCamp Fraud Detection in Python
DataCamp Fraud Detection in Python
DataCamp Fraud Detection in Python
DataCamp Fraud Detection in Python

K-means clustering in Python


# Import the packages
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Transform and scale your data


X = np.array(df).astype(np.float)

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Define the k-means model and fit to the data


kmeans = KMeans(n_clusters=6, random_state=42).fit(X_scaled)
DataCamp Fraud Detection in Python

The right amount of clusters

Checking the number of clusters:

Silhouette method
Elbow curve
clust = range(1, 10)
kmeans = [KMeans(n_clusters=i) for i in clust]

score = [kmeans[i].fit(X_scaled).score(X_scaled) for i in range(len(kmeans)

plt.plot(clust,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
DataCamp Fraud Detection in Python

The Elbow Curve


DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Assigning fraud versus


non-fraud cases

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Starting with clustered data


DataCamp Fraud Detection in Python

Assign the cluster centroids


DataCamp Fraud Detection in Python

Define distances from the cluster centroid


DataCamp Fraud Detection in Python

Flag fraud for those furthest away from cluster centroid


DataCamp Fraud Detection in Python

Flagging fraud based on distance to centroid


# Run the kmeans model on scaled data
kmeans = KMeans(n_clusters=6, random_state=42,n_jobs=-1).fit(X_scaled)

# Get the cluster number for each datapoint


X_clusters = kmeans.predict(X_scaled)

# Save the cluster centroids


X_clusters_centers = kmeans.cluster_centers_

# Calculate the distance to the cluster centroid for each point


dist = [np.linalg.norm(x-y) for x,y in zip(X_scaled,
X_clusters_centers[X_clusters])]

# Create predictions based on distance


km_y_pred = np.array(dist)
km_y_pred[dist>=np.percentile(dist, 93)] = 1
km_y_pred[dist<np.percentile(dist, 93)] = 0
DataCamp Fraud Detection in Python

Validating your model results


Check with the fraud analyst
Investigate and describe cases that are flagged in more detail
Compare to past known cases of fraud
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Other clustering fraud


detection methods

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

There are many different clustering methods


DataCamp Fraud Detection in Python

And different ways of flagging fraud: using smallest


clusters
DataCamp Fraud Detection in Python

In reality it looks more like this


DataCamp Fraud Detection in Python

DBScan versus K-means


No need to predefine amount of clusters
Adjust maximum distance between points within clusters
Assign minimum amount of samples in clusters
Better performance on weirdly shaped data
But..higher computational costs
DataCamp Fraud Detection in Python

Implementing DBscan
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=10, n_jobs=-1).fit(X_scaled)

# Get the cluster labels (aka numbers)


pred_labels = db.labels_

# Count the total number of clusters


n_clusters_ = len(set(pred_labels)) - (1 if -1 in pred_labels else 0)

# Print model results


print('Estimated number of clusters: %d' % n_clusters_)

Estimated number of clusters: 31


DataCamp Fraud Detection in Python

Checking the size of the clusters


# Print model results
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X_scaled, pred_labels))

Silhouette Coefficient: 0.359

# Get sample counts in each cluster


counts = np.bincount(pred_labels[pred_labels>=0])
print (counts)

[ 763 496 840 355 1086 676 63 306 560 134 28 18 262 128 332
22 22 13 31 38 36 28 14 12 30 10 11 10 21 10
5]
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!

You might also like