Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Anomaly/Novelty Detection
with scikit-learn
Alexandre Gramfort
Telecom ParisTech - CNRS LTCI
alexandre.gramfort@telecom-paristech.fr
GitHub : @agramfort Twitter : @agramfort
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
2
Objective: Spot the red apple
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
3
“An outlier is an observation in a data set which appears to
be inconsistent with the remainder of that set of data.”
Johnson 1992
“An outlier is an observation which deviates so much from
the other observations as to arouse suspicions that it was
generated by a different mechanism.”
Hawkins 1980
Outlier/Anomaly
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
4
• Supervised AD
• Labels available for both normal data and anomalies
• Similar to rare class mining / imbalanced classification
• Semi-supervised AD (Novelty Detection)
• Only normal data available to train
• The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
• no labels, training set = normal + abnormal data
• Assumption: anomalies are very rare
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
5
• Supervised AD
• Labels available for both normal data and anomalies
• Similar to rare class mining / imbalanced classification
• Semi-supervised AD (Novelty Detection)
• Only normal data available to train
• The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
• no labels, training set = normal + abnormal data
• Assumption: anomalies are very rare
Alexandre Gramfort Anomaly detection with scikit-learn
ML Taxonomy
6
Machine Learning
SupervisedUnsupervised
Regression Classif.
“Prediction”
Clustering Dim. Red. Anomaly
Novelty
Detection
Alexandre Gramfort Anomaly detection with scikit-learn
Applications
7
• Fraud detection
• Network intrusion
• Finance
• Insurance
• Maintenance
• Medicine (unusual symptoms)
• Measurement errors (from sensors)
Any application where
looking at unusual
observations is
relevant
Anomaly/Novelty detection with scikit-learn
Alexandre Gramfort Anomaly detection with scikit-learn
Big picture
9
Look for samples that are in
low density regions, isolated
Look for a region of the space
that is small in volume but
contains most of the samples
Density based approach (KDE,
Gaussian Ellipse, GMM)
Kernel methods
Nearest neighbors
Trees / Partitioning
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_components=1).fit(X)
>>> log_dens = gmm.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Low density regions
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_components=2).fit(X)
>>> log_dens = gmm.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Novelty detection via density estimates
>>> from sklearn.neighbors import KernelDensity
>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X)
>>> log_dens = kde.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Kernel methods
Nearest neighbors
Trees / Partitioning
Our toy dataset
Kernel Approach
http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html
>>> est = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Nearest neighbors
Trees / Partitioning
Nearest Neighbors (NN) Approach
https://github.com/scikit-learn/scikit-learn/pull/5279
>>> est = LocalOutlierFactor(n_neighbors=5)
Local Outlier Factor (LOF)
https://en.wikipedia.org/wiki/Local_outlier_factor
Trees / Partitioning
Partitioning / Tree Approach
http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html
>>> est = IsolationForest(n_estimators=100)
Isolation Forest
An anomaly can be isolated with a
very shallow random tree
Anomaly/Novelty detection with scikit-learn
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Network intrusions
https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/
bench_isolation_forest.py
Alexandre Gramfort Anomaly detection with scikit-learn
Some caveats
25
• How to set model hyperparameters?
• How to evaluate performance in the unsupervised setup?
• In any AD method there is a notion of metric/similarity
between samples, e.g. Euclidian distance. Unclear how to define
it (think continuous, categorical features etc.)
http://scikit-learn.org/stable/modules/outlier_detection.html
Anomaly/Novelty detection with scikit-learn
Alexandre Gramfort
alexandre.gramfort@telecom-paristech.frContact:
GitHub : @agramfort Twitter : @agramfort
Questions?
1 position to work on Scikit-Learn and Scipy stack available !
Thanks @ngoix & @albertthomas88 for the work

More Related Content

What's hot

Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
guest0edcaf
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
Turi, Inc.
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
QuantUniversity
 
Anomaly Detection Technique
Anomaly Detection TechniqueAnomaly Detection Technique
Anomaly Detection Technique
Chakrit Phain
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
Dr. Stylianos Kampakis
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
QuantUniversity
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
Kuppusamy P
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
ShantanuDeosthale
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
Houw Liong The
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time Series
Humberto Marchezi
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud Detection
Andrea Dal Pozzolo
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
vineeta vineeta
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation Slides
SlideTeam
 
CREDIT_CARD.ppt
CREDIT_CARD.pptCREDIT_CARD.ppt
CREDIT_CARD.ppt
Balasubramani Manickam
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
amalalhait
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
Mainul Hassan
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
Kush Kulshrestha
 
KNN
KNN KNN

What's hot (20)

Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation ForestsAnomaly Detection Using Isolation Forests
Anomaly Detection Using Isolation Forests
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly Detection Technique
Anomaly Detection TechniqueAnomaly Detection Technique
Anomaly Detection Technique
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time Series
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud Detection
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation Slides
 
CREDIT_CARD.ppt
CREDIT_CARD.pptCREDIT_CARD.ppt
CREDIT_CARD.ppt
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
KNN
KNN KNN
KNN
 

More from agramfort

MNE sapien labs 2019
MNE sapien labs 2019MNE sapien labs 2019
MNE sapien labs 2019
agramfort
 
MAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signalsMAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signals
agramfort
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillations
agramfort
 
ICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. GramfortICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. Gramfort
agramfort
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.
agramfort
 
Teaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTechTeaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTech
agramfort
 
Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013
agramfort
 

More from agramfort (7)

MNE sapien labs 2019
MNE sapien labs 2019MNE sapien labs 2019
MNE sapien labs 2019
 
MAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signalsMAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signals
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillations
 
ICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. GramfortICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. Gramfort
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.
 
Teaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTechTeaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTech
 
Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013
 

Recently uploaded

Jacquard Fabric Explained: Origins, Characteristics, and Uses
Jacquard Fabric Explained: Origins, Characteristics, and UsesJacquard Fabric Explained: Origins, Characteristics, and Uses
Jacquard Fabric Explained: Origins, Characteristics, and Uses
ldtexsolbl
 
Blue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failureBlue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failure
Dexbytes Infotech Pvt Ltd
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
 
The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...
maricrismontales
 
Ensuring Secure and Permission-Aware RAG Deployments
Ensuring Secure and Permission-Aware RAG DeploymentsEnsuring Secure and Permission-Aware RAG Deployments
Ensuring Secure and Permission-Aware RAG Deployments
Zilliz
 
Planetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile BrochurePlanetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile Brochure
Planetek Italia Srl
 
Flame Atomic Emission Spectroscopy.-pptx
Flame Atomic Emission Spectroscopy.-pptxFlame Atomic Emission Spectroscopy.-pptx
Flame Atomic Emission Spectroscopy.-pptx
VaishnaviChavan206944
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
Connecting Attitudes and Social Influences with Designs for Usable Security a...
Connecting Attitudes and Social Influences with Designs for Usable Security a...Connecting Attitudes and Social Influences with Designs for Usable Security a...
Connecting Attitudes and Social Influences with Designs for Usable Security a...
Cori Faklaris
 
Project Delivery Methodology on a page with activities, deliverables
Project Delivery Methodology on a page with activities, deliverablesProject Delivery Methodology on a page with activities, deliverables
Project Delivery Methodology on a page with activities, deliverables
CLIVE MINCHIN
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
Scientific-Based Blockchain TON Project Analysis Report
Scientific-Based Blockchain  TON Project Analysis ReportScientific-Based Blockchain  TON Project Analysis Report
Scientific-Based Blockchain TON Project Analysis Report
SelcukTOPAL2
 
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...
Juan Carlos Gonzalez
 
Leading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online RetailersLeading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online Retailers
SynapseIndia
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
Epicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptxEpicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptx
Piyush Khalate
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 

Recently uploaded (20)

Jacquard Fabric Explained: Origins, Characteristics, and Uses
Jacquard Fabric Explained: Origins, Characteristics, and UsesJacquard Fabric Explained: Origins, Characteristics, and Uses
Jacquard Fabric Explained: Origins, Characteristics, and Uses
 
Blue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failureBlue Screen Of Death | Windows Down | Biggest IT failure
Blue Screen Of Death | Windows Down | Biggest IT failure
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
 
The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...The learners analyze the various sectors of ICT and evaluate the potential ca...
The learners analyze the various sectors of ICT and evaluate the potential ca...
 
Ensuring Secure and Permission-Aware RAG Deployments
Ensuring Secure and Permission-Aware RAG DeploymentsEnsuring Secure and Permission-Aware RAG Deployments
Ensuring Secure and Permission-Aware RAG Deployments
 
Planetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile BrochurePlanetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile Brochure
 
Flame Atomic Emission Spectroscopy.-pptx
Flame Atomic Emission Spectroscopy.-pptxFlame Atomic Emission Spectroscopy.-pptx
Flame Atomic Emission Spectroscopy.-pptx
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
Connecting Attitudes and Social Influences with Designs for Usable Security a...
Connecting Attitudes and Social Influences with Designs for Usable Security a...Connecting Attitudes and Social Influences with Designs for Usable Security a...
Connecting Attitudes and Social Influences with Designs for Usable Security a...
 
Project Delivery Methodology on a page with activities, deliverables
Project Delivery Methodology on a page with activities, deliverablesProject Delivery Methodology on a page with activities, deliverables
Project Delivery Methodology on a page with activities, deliverables
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
Scientific-Based Blockchain TON Project Analysis Report
Scientific-Based Blockchain  TON Project Analysis ReportScientific-Based Blockchain  TON Project Analysis Report
Scientific-Based Blockchain TON Project Analysis Report
 
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...
 
Leading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online RetailersLeading Bigcommerce Development Services for Online Retailers
Leading Bigcommerce Development Services for Online Retailers
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
Epicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptxEpicor Kinetic REST API Services Overview.pptx
Epicor Kinetic REST API Services Overview.pptx
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 

Anomaly/Novelty detection with scikit-learn

  • 1. Anomaly/Novelty Detection with scikit-learn Alexandre Gramfort Telecom ParisTech - CNRS LTCI alexandre.gramfort@telecom-paristech.fr GitHub : @agramfort Twitter : @agramfort
  • 2. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 2 Objective: Spot the red apple
  • 3. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 3 “An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.” Johnson 1992 “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Hawkins 1980 Outlier/Anomaly
  • 4. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 4 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  • 5. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 5 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  • 6. Alexandre Gramfort Anomaly detection with scikit-learn ML Taxonomy 6 Machine Learning SupervisedUnsupervised Regression Classif. “Prediction” Clustering Dim. Red. Anomaly Novelty Detection
  • 7. Alexandre Gramfort Anomaly detection with scikit-learn Applications 7 • Fraud detection • Network intrusion • Finance • Insurance • Maintenance • Medicine (unusual symptoms) • Measurement errors (from sensors) Any application where looking at unusual observations is relevant
  • 9. Alexandre Gramfort Anomaly detection with scikit-learn Big picture 9 Look for samples that are in low density regions, isolated Look for a region of the space that is small in volume but contains most of the samples
  • 10. Density based approach (KDE, Gaussian Ellipse, GMM) Kernel methods Nearest neighbors Trees / Partitioning
  • 11. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=1).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7) Low density regions
  • 12. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=2).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  • 13. Novelty detection via density estimates >>> from sklearn.neighbors import KernelDensity >>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) >>> log_dens = kde.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  • 18. Nearest Neighbors (NN) Approach https://github.com/scikit-learn/scikit-learn/pull/5279 >>> est = LocalOutlierFactor(n_neighbors=5)
  • 19. Local Outlier Factor (LOF) https://en.wikipedia.org/wiki/Local_outlier_factor
  • 21. Partitioning / Tree Approach http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html >>> est = IsolationForest(n_estimators=100)
  • 22. Isolation Forest An anomaly can be isolated with a very shallow random tree
  • 25. Alexandre Gramfort Anomaly detection with scikit-learn Some caveats 25 • How to set model hyperparameters? • How to evaluate performance in the unsupervised setup? • In any AD method there is a notion of metric/similarity between samples, e.g. Euclidian distance. Unclear how to define it (think continuous, categorical features etc.)
  • 28. Alexandre Gramfort alexandre.gramfort@telecom-paristech.frContact: GitHub : @agramfort Twitter : @agramfort Questions? 1 position to work on Scikit-Learn and Scipy stack available ! Thanks @ngoix & @albertthomas88 for the work