0% found this document useful (0 votes)

95 views

Fraud Detection in Python Chapter4

This document discusses techniques for fraud detection using text data in Python. It covers word search, sentiment analysis, word frequencies, topic modeling, and using topic modeling results for fraud detection. Specific techniques covered include word search flags, word counts using Pandas, text preprocessing including tokenization and removing stopwords, topic modeling using LDA, and flagging fraud based on topic similarities between fraudulent and non-fraudulent cases.

Uploaded by

Fgpeqw

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views

Fraud Detection in Python Chapter4

Uploaded by

Fgpeqw

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Using text data to

detect fraud

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

You will often encounter text data during fraud detection

Types of useful text data:

1. Emails from employees and/or clients

2. Transaction descriptions
3. Employee notes
4. Insurance claim form description box
5. Recorded telephone conversations
6. ...
DataCamp Fraud Detection in Python

Text mining techniques for fraud detection

1. Word search
2. Sentiment analysis
3. Word frequencies and topic analysis
4. Style
DataCamp Fraud Detection in Python

Word search for fraud detection

Flagging suspicious words:

1. Simple, straightforward and

easy to explain
2. Match results can be used as a
ﬁlter on top of machine
learning model
3. Match results can be used as a
feature in a machine learning
model
DataCamp Fraud Detection in Python

Word counts to ﬂag fraud with pandas

# Using a string operator to find words
df['email_body'].str.contains('money laundering')

# Select data that matches

df.loc[df['email_body'].str.contains('money laundering', na=False)]

# Create a list of words to search for

list_of_words = ['police', 'money laundering']
df.loc[df['email_body'].str.contains('|'.join(list_of_words)
, na=False)]

# Create a fraud flag

df['flag'] = np.where((df['email_body'].str.contains('|'.join
(list_of_words)) == True), 1, 0)
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Text mining techniques

for fraud detection

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Cleaning your text data

Must do's when working with textual data:

1. Tokenization

2. Remove all stopwords

3. Lemmatize your words

4. Stem your words

DataCamp Fraud Detection in Python

Go from this...
DataCamp Fraud Detection in Python

To this...
DataCamp Fraud Detection in Python

Data preprocessing part 1

# 1. Tokenization
from nltk import word_tokenize

text = df.apply(lambda row: word_tokenize(row["email_body"]), axis=1)

text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)

# 2. Remove all stopwords and punctuation

from nltk.corpus import stopwords
import string

exclude = set(string.punctuation)
stop = set(stopwords.words('english'))
stop_free = " ".join([word for word in text
if((word not in stop) and (not word.isdigit()))])
punc_free = ''.join(word for word in stop_free
if word not in exclude)
DataCamp Fraud Detection in Python

Data preprocessing part 2

# Lemmatize words
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

# Stem words
from nltk.stem.porter import PorterStemmer
porter= PorterStemmer()
cleaned_text = " ".join(porter.stem(token) for token in normalized.split())

print (cleaned_text)

['philip','going','street','curious','hear','perspective','may','wish',
'offer','trading','floor','enron','stock','lower','joined','company',
'business','school','imagine','quite','happy','people','day','relate',
'somewhat','stock','around','fact','broke','day','ago','knowing',
'imagine','letting','event','get','much','taken','similar',
'problem','hope','everything','else','going','well','family','knee',
'surgery','yet','give','call','chance','later']
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Topic modelling

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Topic modelling: discover hidden patterns in text data

1. Discovering topics in text data
2. "What is the text about"
3. Conceptually similar to clustering data
4. Compare topics of fraud cases to non-fraud cases and use as a
feature or ﬂag
5. Or.. is there a particular topic in the data that seems to point to
fraud?
DataCamp Fraud Detection in Python

Latent Dirichlet Allocation (LDA)

With LDA you obtain:

1. "topics per text item" model (i.e. probabilities)

2. "words per topic" model

Creating your own topic model:

1. Clean your data

2. Create a bag of words with dictionary and corpus
3. Feed dictionary and corpus into the LDA model
DataCamp Fraud Detection in Python

Latent Dirichlet Allocation (LDA)

DataCamp Fraud Detection in Python

Bag of words: dictionary and corpus

from gensim import corpora

# Create dictionary number of times a word appears

dictionary = corpora.Dictionary(cleaned_emails)

# Filter out (non)frequent words

dictionary.filter_extremes(no_below=5, keep_n=50000)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in cleaned_emails]
DataCamp Fraud Detection in Python

Latent Dirichlet Allocation (LDA) with gensim

import gensim

# Define the LDA model

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3,
id2word=dictionary, passes=15)

# Print the three topics from the model with top words
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)

(0, ‘0.029”email” + 0.016”send” + 0.016”results” + 0.016”invoice”’)

(1, ‘0.026*”price” + 0.026*”work” + 0.026*”management” + 0.026*”sell”’)
(2, ‘0.029*”distribute” + 0.029*”contact” + 0.016*”supply” + 0.016*”fast”’)
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Flagging fraud based

on topics

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Using your LDA model results for fraud detection

1. Are there any suspicious topics? (no labels)
2. Are the topics in fraud and non-fraud cases similar? (with labels)
3. Are fraud cases associated more with certain topics? (with labels)
DataCamp Fraud Detection in Python

To understand topics, you need to visualize

import pyLDAvis.gensim

lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus,

dictionary, sort_topics=False)

pyLDAvis.display(lda_display)
DataCamp Fraud Detection in Python

Inspecting how topics diﬀer

DataCamp Fraud Detection in Python

Assign topics to your original data

def get_topic_details(ldamodel, corpus):
topic_details_df = pd.DataFrame()
for i, row in enumerate(ldamodel[corpus]):
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_details_df = topic_details_df.append(pd.Series([topic
topic_details_df.columns = ['Dominant_Topic', '% Score']
return topic_details_df

contents = pd.DataFrame({'Original text':text_clean})

topic_details = pd.concat([get_topic_details(ldamodel,
corpus), contents], axis=1)
topic_details.head()

Dominant_Topic % Score Original text

0 0.0 0.989108 [investools, advisory, free, ...
1 0.0 0.993513 [forwarded, richard, b, ...
2 1.0 0.964858 [hey, wearing, target, purple, ...
3 0.0 0.989241 [leslie, milosevich, santa, clara, ...
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Let's practice!
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

Fraud detection in
Python Recap

Charlotte Werger
Data Scientist
DataCamp Fraud Detection in Python

Working with imbalanced data

Worked with highly imbalanced fraud data
Learned how to resample your data
Learned about diﬀerent resampling methods
DataCamp Fraud Detection in Python

Fraud detection with labeled data

Refreshed supervised learning techniques to detect fraud
Learned how to get reliable performance metrics and worked with the
precision recall trade-oﬀ
Explored how to optimise your model parameters to handle fraud data
Applied ensemble methods to fraud detection
DataCamp Fraud Detection in Python

Fraud detection without labels

Learned about the importance of segmentation
Refreshed your knowledge on clustering methods
Learned how to detect fraud using outliers and small clusters with K-
means clustering
Applied a DB-scan clustering model for fraud detection
DataCamp Fraud Detection in Python

Text mining for fraud detection

Know how to augment fraud detection analysis with text mining
techniques
Applied word searches to ﬂag use of certain words, and learned how to
apply topic modelling for fraud detection
Learned how to eﬀectively clean messy text data
DataCamp Fraud Detection in Python

Further learning for fraud detection

Network analysis to detect fraud
Diﬀerent supervised and unsupervised learning techniques (e.g. Neural
Networks)
Working with very large data
DataCamp Fraud Detection in Python

FRAUD DETECTION IN PYTHON

End of this course

Web User Interface Design Techniques
50% (2)
Web User Interface Design Techniques
9 pages
Chat History
No ratings yet
Chat History
106 pages
AML Admin Guide PDF
No ratings yet
AML Admin Guide PDF
45 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
RLS Series - Software User Manual (RT V1.0)
100% (1)
RLS Series - Software User Manual (RT V1.0)
78 pages
Why Dow Jones Risk & Compliance?
No ratings yet
Why Dow Jones Risk & Compliance?
15 pages
Technical Solution Document: Version Number: 0.0 Version Date: May 9, 2016
No ratings yet
Technical Solution Document: Version Number: 0.0 Version Date: May 9, 2016
20 pages
Solution Architecture Proposal Requirements Package Shuber
No ratings yet
Solution Architecture Proposal Requirements Package Shuber
35 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
39 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Foxboro Evo™ Process Automation System: Product Specifications
No ratings yet
Foxboro Evo™ Process Automation System: Product Specifications
20 pages
Fraud Detection in Python Chapter1
No ratings yet
Fraud Detection in Python Chapter1
25 pages
Money Laundering Issues - October 2011
100% (1)
Money Laundering Issues - October 2011
35 pages
Vijay Shinde: Planning Skills For Managing Business Operations & Meeting Top / Bottom-Line Objectives
No ratings yet
Vijay Shinde: Planning Skills For Managing Business Operations & Meeting Top / Bottom-Line Objectives
3 pages
Credit Card Fraud Detection
100% (1)
Credit Card Fraud Detection
36 pages
Fraud Solution For Financial Services With Sas - Capgemini & SAS
No ratings yet
Fraud Solution For Financial Services With Sas - Capgemini & SAS
8 pages
Pov Aml Transaction Monitoring Governance Framework Protiviti
No ratings yet
Pov Aml Transaction Monitoring Governance Framework Protiviti
4 pages
NICE Actimize Fraud Insights
No ratings yet
NICE Actimize Fraud Insights
13 pages
AML Core Brochure
0% (1)
AML Core Brochure
4 pages
GoAML Manual
No ratings yet
GoAML Manual
25 pages
AML 7.x Reporting Deployment Guide Windows v2.1
No ratings yet
AML 7.x Reporting Deployment Guide Windows v2.1
13 pages
Optimizing Anti-Money Laundering Transaction Monitoring Systems Using SAS® Analytical Tools
No ratings yet
Optimizing Anti-Money Laundering Transaction Monitoring Systems Using SAS® Analytical Tools
10 pages
Utilization of Rules in Anti-Money Laundering Compliance Monitoring Programs
No ratings yet
Utilization of Rules in Anti-Money Laundering Compliance Monitoring Programs
5 pages
SAS AML Next Generation Apr2022
No ratings yet
SAS AML Next Generation Apr2022
12 pages
The New Basel Capital Accord
No ratings yet
The New Basel Capital Accord
24 pages
Robotic Process Automation Aml Kyc
No ratings yet
Robotic Process Automation Aml Kyc
8 pages
Anti Money Laundering PDF
0% (1)
Anti Money Laundering PDF
8 pages
FINTRAC - Individual Identification Information Record
No ratings yet
FINTRAC - Individual Identification Information Record
4 pages
Interview Questions
No ratings yet
Interview Questions
18 pages
CC Fraud Analytics Capstone
No ratings yet
CC Fraud Analytics Capstone
10 pages
Credit Card Fraud Detection Using Machine Learning
100% (1)
Credit Card Fraud Detection Using Machine Learning
5 pages
Chapter Fourteen: Other Lending Institutions
No ratings yet
Chapter Fourteen: Other Lending Institutions
15 pages
Pov Aml Transaction Monitoring Customer Segmentation Protiviti
No ratings yet
Pov Aml Transaction Monitoring Customer Segmentation Protiviti
4 pages
KYC Passing Rate Data Analysis Using Feature Engineering Clustering and Data Visualization
No ratings yet
KYC Passing Rate Data Analysis Using Feature Engineering Clustering and Data Visualization
56 pages
Data Analytics Consulting: Mohammad Waseem Shaikh 17cs002052
No ratings yet
Data Analytics Consulting: Mohammad Waseem Shaikh 17cs002052
16 pages
A Comprehensive Method For Assessment of Operational Risk in E-Banking
No ratings yet
A Comprehensive Method For Assessment of Operational Risk in E-Banking
7 pages
CFCS Candidate Handbook
0% (1)
CFCS Candidate Handbook
21 pages
Anti Money Laundering
No ratings yet
Anti Money Laundering
8 pages
OFAC Name Matching and False Positive Reduction Techniques Codex1016 PDF
100% (1)
OFAC Name Matching and False Positive Reduction Techniques Codex1016 PDF
13 pages
Online Banking Business Requirement Document
No ratings yet
Online Banking Business Requirement Document
18 pages
Sanction Screening Guide
No ratings yet
Sanction Screening Guide
11 pages
Proviti Aml Kyd
No ratings yet
Proviti Aml Kyd
4 pages
BRD - Bank of Alberta Online Banking - V1.1
100% (1)
BRD - Bank of Alberta Online Banking - V1.1
30 pages
Security Challenges in The Evolving Fintech Landscape
No ratings yet
Security Challenges in The Evolving Fintech Landscape
8 pages
A00-220 Study Guide and How To Crack Exa PDF
No ratings yet
A00-220 Study Guide and How To Crack Exa PDF
6 pages
CEB - IT Budget Key Findings
No ratings yet
CEB - IT Budget Key Findings
3 pages
Interest Rate Swap Introduction
No ratings yet
Interest Rate Swap Introduction
6 pages
Fighting Financial Crime Amidst Growing Complexity: T N R AML T A
No ratings yet
Fighting Financial Crime Amidst Growing Complexity: T N R AML T A
13 pages
Certification Guide On SAS Big Data Preparation, Statistics, and Visual Exploration (A00-220) Professional Exam
100% (1)
Certification Guide On SAS Big Data Preparation, Statistics, and Visual Exploration (A00-220) Professional Exam
15 pages
Algorithmic Financial Trading With Deep CNN Preprint
No ratings yet
Algorithmic Financial Trading With Deep CNN Preprint
30 pages
IA and Bancking
No ratings yet
IA and Bancking
22 pages
Blockchain Technology PDF
No ratings yet
Blockchain Technology PDF
6 pages
Groww BRD
No ratings yet
Groww BRD
16 pages
AML Conference Presentations
No ratings yet
AML Conference Presentations
60 pages
Opportunities For Artificial Intelligence
No ratings yet
Opportunities For Artificial Intelligence
10 pages
Anti Money Laundering Policy: Background
No ratings yet
Anti Money Laundering Policy: Background
8 pages
Aml Module 1 Saq
No ratings yet
Aml Module 1 Saq
11 pages
SAS Banking - Basel II Solutions
No ratings yet
SAS Banking - Basel II Solutions
12 pages
AML Brochure FATCACompliance
No ratings yet
AML Brochure FATCACompliance
4 pages
Understand The Impact of Genai On Indian Healthcare Ecosystem
No ratings yet
Understand The Impact of Genai On Indian Healthcare Ecosystem
48 pages
Alessa Sanctions Screening Best Practices Online 01-1 PDF
No ratings yet
Alessa Sanctions Screening Best Practices Online 01-1 PDF
24 pages
What Is An Agent Bank?
No ratings yet
What Is An Agent Bank?
6 pages
Sample Questions Big Data Preparation
No ratings yet
Sample Questions Big Data Preparation
4 pages
UNIT-1: Information System Concepts
100% (1)
UNIT-1: Information System Concepts
53 pages
Business rules A Complete Guide
From Everand
Business rules A Complete Guide
Gerardus Blokdyk
No ratings yet
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Spoken Language Processing in Python Chapter2
No ratings yet
Spoken Language Processing in Python Chapter2
23 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Credit Risk Modeling in Python Chapter4
100% (1)
Credit Risk Modeling in Python Chapter4
35 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
WM - Cut Over Act
No ratings yet
WM - Cut Over Act
21 pages
The Future of Programming
No ratings yet
The Future of Programming
11 pages
Advantages & Disadvantages of Blockchain Technology - Blockchain Technology
100% (1)
Advantages & Disadvantages of Blockchain Technology - Blockchain Technology
3 pages
CV 2022092720050375
No ratings yet
CV 2022092720050375
3 pages
Datasheet: Modbus TCP/IP To IEC 61850 Gateway
No ratings yet
Datasheet: Modbus TCP/IP To IEC 61850 Gateway
4 pages
Mod Menu Crash 2024 01 04-01 29 47
No ratings yet
Mod Menu Crash 2024 01 04-01 29 47
1 page
Library Reference: Communication Server 1000 Release 5.0
No ratings yet
Library Reference: Communication Server 1000 Release 5.0
44 pages
Content Writing Master Course: 4 Weeks of Live Training
No ratings yet
Content Writing Master Course: 4 Weeks of Live Training
20 pages
How To Change Password of Hikvision Device
No ratings yet
How To Change Password of Hikvision Device
7 pages
Erp Unit - 1
No ratings yet
Erp Unit - 1
13 pages
ESXTOP
No ratings yet
ESXTOP
33 pages
Message 38
No ratings yet
Message 38
3 pages
2021 A Survey On Wearable Technology History, State-Of-The-Art and
No ratings yet
2021 A Survey On Wearable Technology History, State-Of-The-Art and
37 pages
SBAM1991 DV5950 LINUX M24 V4.2.0and4.2.0.p3
100% (1)
SBAM1991 DV5950 LINUX M24 V4.2.0and4.2.0.p3
21 pages
Contemporary Praise Today's Worship Songs For Solo Piano - Anna's Archive
No ratings yet
Contemporary Praise Today's Worship Songs For Solo Piano - Anna's Archive
1 page
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Sub Netting Tip Sheet
No ratings yet
Sub Netting Tip Sheet
1 page
Webinar 5
No ratings yet
Webinar 5
24 pages
NBKR Institute of Science & Technology:: Vidyanagar
No ratings yet
NBKR Institute of Science & Technology:: Vidyanagar
23 pages
White Paper On V Band Final
No ratings yet
White Paper On V Band Final
24 pages
WAP Via CSD Settings For LG 500, 601: For Postpaid Enter Gprsmtnlmum For Prepaid Enter Gprsppsmum
No ratings yet
WAP Via CSD Settings For LG 500, 601: For Postpaid Enter Gprsmtnlmum For Prepaid Enter Gprsppsmum
15 pages
Keyboard
No ratings yet
Keyboard
28 pages
I6157630 - TG1 - Cara Henning - Machine Learning - Voice and Speech Recognition System
No ratings yet
I6157630 - TG1 - Cara Henning - Machine Learning - Voice and Speech Recognition System
11 pages
CVS User Guide CVS User Guide: Page 1 of 27
No ratings yet
CVS User Guide CVS User Guide: Page 1 of 27
27 pages
Smart Surveillance System by Amrit Sinha, Ankur Singh and Manas Kashyap USN 1CR14CS011, 1CR14CS015 and 1CR14CS076
No ratings yet
Smart Surveillance System by Amrit Sinha, Ankur Singh and Manas Kashyap USN 1CR14CS011, 1CR14CS015 and 1CR14CS076
66 pages
Master OOPs Concepts in Java
No ratings yet
Master OOPs Concepts in Java
10 pages
4 - Progress Test: Working With Words
No ratings yet
4 - Progress Test: Working With Words
2 pages