Outlier analysis for Temporal Datasets

Location:
QuantUniversity Meetup
July 11th 2016
Boston MA
Outlier Analysis for Temporal Datasets
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com

2
Slides and Code available at:
http://www.analyticscertificate.com/Anomaly/

3
• 6.30-7.15 – Anomaly Detection part II
• 7.15-8.00 - Azure ML Example
Agenda

- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers (Shell, Firstfuel Software etc.)
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
5

6
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program later in Fall

(MATLAB version also available)

8
• July
▫ 11th : QuantUniversity’s 2nd meetup
 Topic : Quantitative methods topic : TBD
• August
▫ 1st and 2nd : 2-day workshop on Anomaly Detection
 Registration and pricing details at www.analyticscertificate.com/Anomaly
▫ 8th : QuantUniversity meetup
▫ 14-20th : ARPM in New York www.arpm.co
 QuantUniversity presenting on Model Risk on August 14th
▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-
data-bootcamp/event.html
▫ Use promotional code SPEAKERREF to receive $200 discount on or before July
22nd
Events of Interest

9
• July
▫ Anomaly Detection Part II
• August
▫ Anomaly Detection Workshop
▫ Model Evaluation : Metrics, Scaling and Best Practices
• September
▫ What’s missing ? Best practices in missing data analysis
QuantUniversity’s Summer workshop series

What is anomaly detection?
• Anomalies or outliers are data points that appear to deviate
markedly from expected outputs.
• It is the process of finding patterns in data that don’t
conform to a prior expected behavior.
11

12
• Fraud Detection
• Stock market
• E-commerce
Examples

Part 1: Summary
13
We have covered Anomaly detection
Introduction  Definition of anomaly detection and its importance in energy systems
 Different types of anomaly detection methods: Statistical, graphical and machine
learning methods
Graphical approach  Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol
plot to demonstrate outliers graphically
 The main assumption for applying graphical approaches is multivariate normality
 Mahalanobis distance methods is mainly used for calculating the distance of a point
from a center of multivariate distribution
Statistical approach  Statistical hypothesis testing includes of: Chi-square, Grubb’s test
 Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning approach  Both supervised and unsupervised learning methods can be used for outlier detection
 Piece wised or segmented regression can be used to identify outliers based on the
residuals for each segment
 In K-means clustering method outliers are defined as points which have doesn’t belong
to any cluster, are far away from the centroids of the cluster or shaping sparse clusters

Anomaly Detection Part II : Dealing with Temporal Data
• In time series datasets, the assumption of temporal continuity plays
an important role in defining and detecting outliers.
• When analyzing single time series, the lack of temporal continuity
with immediate neighbors signal outliers. For example:
▫ A significant increase/decrease in value when compared with
immediate neighboring values . Example: Stock charts
• When analyzing multidimensional time series streams, temporal
continuity is much weaker. For example:
▫ Novel outliers that differ from aggregate trends. Example : Novel client
traffic from a new location in Google analytics

Point anomalies
• Points that or outside of “normal” points

Contextual anomalies
• Time is a contextual attribute that
determines the position of an instance
on the entire sequence.
• 145 point drop is not rare
but it is an anomaly if the drop happens
in a period of 3 minutes
Ref: http://www.bloomberg.com/news/articles/2013-04-23/fake-report-erasing-
136-billion-shows-market-s-fragility?cmpid=yhoo

Nuances in Time series analysis
• Time Series Analysis
▫ Numbers across time
▫ Example: Stock data
• Discrete sequences
▫ Labels across time
▫ Example: Log of client interactions
▫ http-web, buffer-overﬂow, http-web, http-web, smtp-mail, ftp, http-
web, ssh, smtp-mail, http-web, ssh, buffer-overﬂow, ftp, http-web, ftp,
smtp-mail,http-web

Collective anomalies
• Here, a collection of related data instances is anomalous with
respect to the entire data set
Ref: http://krebsonsecurity.com/2010/10/pill-gang-used-microsofts-network-to-attack-krebsonsecurity-com/

Challenges
• Defining what is normal and what isn’t

Challenges
• The notion of normal behavior keeps evolving

Challenges
• The magnitude of the anomaly may be different

Challenges
• Labels may not be available

Challenges
• Noise may manifest as anomalies and it may be difficult to identify
and remove.

Methods for Anomaly Detection
Univariate
data
• Point outlier scenario:
• Statistical methods (ARIMA, Seasonal Hybrid ESD test method, E-
Divisive with medians, LOESS regression)
• Data mining methods (Multi layer perceptron)
• Outlier subsequences scenario:
• Windows based method
• Distance based method(PAA, SAX and HOTSAX)
Multivariate
data
• Statistical methods:
• Cook’s distance
• Bonferroni’s test
• Distance based methods:
• Local Outlier Factors (LOF)
• Data mining methods:
• Clustering algorithms (Hierarchical and K-Means)

Methods for Anomaly Detection
Database time
series univariate
and multivariate
data
• Density approach for principal components
• Graphical methods:
• Bivariate and functional bag plots
• Bivariate and functional HDR box plots
• Clustering methods
• Euclidean, correlation, autocorrelation and Wavelet
transform metrics
Censored survival
data
• Statistical methods:
• Residual based algorithm
• Scoring algorithm

26
• Point Outliers
▫ Prediction models
▫ Profile Similarity-based approaches and Deviants
• Subsequence Outliers
▫ Discord discovery
Single Time Series – Sample approaches

27
• Input: A time series t
• Output: Outlier points in t
Prediction Models: Compute outlier scores as deviation from
predicted value
• Median :
▫ Choose a window size k
▫ Compute median in the window t-k and t+k
• Mean:
▫ Choose a window size k
▫ Compute mean in the window t-k and t+k
Point outliers

28
• ARIMA framework
Point outliers : Prediction Models

29
• Neural Networks
▫ MLP predictor
Point outliers : Prediction Models
Original data
Fitted data
Boundaries
Any data points
that are beyond
the boundaries are
considered as
outliers

30
• Create a Normal profile (Example: MLP/AR etc. ) and notion of
variance
• Estimate the next point
• Compare realized value with the estimated point.
▫ If within band, normal
▫ Else, Outlier
Point outliers : Profile Similarity-Based Approach

31
• Find points in a given time series whose removal from the time
series results in a more succinct representation of the data
Point outliers : Deviant Approach

32
• Input: A time series t
• Output: Outlier subsequences in t
• Problem: Given t, and subsequence of length n, find outlier D that
has the largest distance to its nearest non-overlapping match
• In particular, given two subsequences of length n denoted by A = (a1
. . . an) and B = (b1 . . . bn), the Euclidean distance between them
can be computed as follows:
• Dist A, B = σi=1
n
(ai − bi)2
Subsequence outliers:

33
• The standard way of discretizing the time series: Symbolic
Approximation (SAX)
• The brute force solution is to consider all possible subsequences and
compute the distance of each such subsequence with each other
non-overlapping subsequences.
• Several optimizations
▫ HOT-SAX (Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most
unusual time series subsequence. Proceeding ICDM '05 Proceedings of
the Fifth IEEE International Conference on Data Mining)
SAX

• Plotting the discords
Outlier subsequences (Distance based)
The top discord which
has the largest distance
is 411th time series
point.

Summary
We have covered Anomaly detection
Univariate data  Statistical methods (ARIMA, Seasonal Hybrid ESD test method, EMD and LOESS
regression)
 Data mining methods (Multi layer perceptron)
 Outlier subsequences (Windows and distance based methods)
Multivariate data  Cook’s distance
 Bonferroni’s test
 Local outlier factor (LOF)
 Hierarchical and K-means clustering outlier detection methods
Database time series  Database time series definition
 Density approach for two first principle component scores
 Bivariate and functional bag plots
 Bivariate and functional HDR box plot
 Clustering time series
Censored survival data  Censored survival data definition
 Residual based algorithm
 Scoring algorithm

Outlier analysis for Temporal Datasets

37
Register here:
https://www.eventbrite.com/e/anomaly-detection-workshop-tickets-25910035614?ref=ebtnebtckt
Affiliate discount pricing for QuantUniversity Meetup members and Academics!
When: August 1st and 2nd
Where: 1 Roger St, Cambridge MA
(IBM’s offices)
Time : 9-5.00pm

38
Q&A
Slides, code and details about the Anomaly detection workshop
at: http://www.analyticscertificate.com/Anomaly/

Thank you!
Members & Sponsors!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
39

Outlier analysis for Temporal Datasets

More Related Content

Outlier analysis for Temporal Datasets