Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One

Karthik Aaravabhoomi
July 20, 2016
Welcome Data Enthusiasts

• More than 65 million customer accounts
• More than 44,000 associates
• Largest US direct bank
• 3rd largest independent auto loan originator
• 4th largest credit card issuer in the US
Capital One at a glance

• Overview of Cyber – Technology Data and Analytics Frameworks: motivation,
vision, and roadmap.
• Architecture overview
• Machine Learning use case
• Governance and Progression
• Key Benefits
The Focus of Today’s Discussion

Leveraging big data we can create a single pane of glass, automate and enrich alerts to ease the burden on our
analysts
Bad Actors Attack Capital One and Our Tools Monitor and Generate Lots of
Alerts in Disparate Tools for Our Analysts to Analyze

Technology
Analytics
Security Analytics
Sample Use Cases
• Malware using brute force attempt to login
• Malware detection acceleration due to watering
hole attack
• Traffic to/from high risk geo-locations
• Full assessment of a security breach, pulling
together all relevant security and non-security
events involved
• Evaluation of privileged user behavior to identify
outliers from normal patterns
Sample Use Cases
• Predict performance and workload profile for
complex multi-tenant environments
• Unified dashboard that displays real-time
backup status of servers and databases
• Recommend device locations, and failure
impact based on resiliency requirements
• Provide capacity answers to business in real-
time
“What threats are occurring in our
environment and where do we need to take
action to address bad actors?”
“What is the health of the Capital One
environment and where do we see
degradation in performance?”
Primary Focus: Security Primary Focus: Technology
Common Requirements
• Data aggregation • Event correlation • Data visualization & reporting• Data enrichment • Predictive Modeling
The Cyber –Tech Data Lake provides the data processing capabilities to meet
the analytical needs for Security and Technology Operations

The Cyber Data Lake will
provide new capabilities:
• Predict Insider Threats
• Identify Cyber Criminals
• Predict Sophisticated
Attacks
• Automate Incident
Management
• Alert phishing attacks
• Centralize storage
Log Data Sources Enrichment Visualization Machine Learning
• Web Proxy
• Syslog
• Email
• Firewall
The Cyber Data Lake will be a Differentiator for Our Cybersecurity Program

Create value through fast prototyping.
Bridge the gap between prototype and production.
Show how open collaboration produces network effects.
Accelerate our partners’ transformation.
The Frameworks and Platform Team’s Mission Centers on Facilitating
Innovation and Transformation within the Organization

Unsupervised Learning
Supervised Learning
Supervised and unsupervised are two highly complimentary
techniques for understanding data and building smart decisioning
Feature Engineering
Machine Learning Enables the Ability for Algorithms to Iteratively Learn,
which Allows Us to Find Hidden Insight without Direct Programming

Many models can be combined and applied to multiple use cases to detect
broad, complex threat patterns.

Model build process
Data collection
Data
exploration
Variable
reduction
Variable
cleaning
Model selection Validation Deployment Documentation
Model builds are a highly-iterative process comprised of several universal
steps

Easy to use
• Users must be able to add features easily
Highly efficient
• Product must have high performance and minimize waste due to re-work, errors
Scalable
• We should have the ability to scale this multiple applications and entities
Platform agnostic
• The attributes must be able to work on any platform- Hadoop, AWS and potentially others
Well-governed
• Attributes must protect our IP
Based on 5 Core Principles

Leveraging H20
Mission
Augment human judgment by harnessing machine learning
Objectives
• Best Practices: Develop implementations of established modeling best practices for Data
Scientists using general purpose programming languages (e.g., Python, Java, Scala).
• Automation: Enable end-to-end automation of a model build, including generation of risk
management and regulatory artifacts, to reduce iteration times and enable more thorough
analysis.
• Portability: Abstract over tool choice so analytics can be scaled from laptops to next
generation Big Data tools with minimal rework.
A supervised/Unsupervised learning and model risk management framework

How?
A supervised/Unsupervised learning and model risk management framework
Objectives
• Best Practices: Work closely with Model Risk office, Decision Sciences, and
Engineering teams to identify and prioritize best practices for implementation.
• Automation: Build on top of H20, a framework for automating complex data processing
workflows involving multiple frameworks.
• Portability: Develop a high level API focused on modeling tasks, with a variety of
implementations enabling tool substitution “under the hood”.

Data Extraction Data Parsing
Feature
Selection
Model
Development
Model
Management
Model
Comparison
Model(s)
• Extract Load
Transform
• Adaptors/
Connectors
Data Pipeline
Format
Conversion
Data Prep
• Group, sort,
selection,
impute etc.
• Create tabular
output for
feature selection
Data Munging
Feature
Imputation
• Create feature
extraction
routines
• Algorithms to
check and
validate selected
features
Feature Pipeline Model Pipeline Deployment
Data Pipelines
Continuous
Integration
• Model metrics
and selection
• Model
management
• Scoring
Services
• Build Integration
• Pipeline
Integration
Development and Deployment Pipeline using H2O

Component Architecture – Model Building
Machine
Logs
Firewall
Logs
Device
Logs
LogAggregation(Rawevents)
Amazon S3
Feature Pipeline
Model Pipeline
Row
Incremental
Batch
Large Batch
User Interface
Alerts Batch Processing API
Data Pipeline and Munging
Incremental
Load
In-Memory Data
store
Feature
Extraction
Streaming Data Integration
Feature Imputation

H2O Model Execution Pipeline – Batch & Real Time
Real Time
Events
DStream
(Raw Data
over time
window)
Sparkling Water
UI
Spark Streaming
Spark RDD
H2O Frame
Raw Data
H2O Frames
(Feature Data
using Feat-
Ext.py)
Bolt
Feat-Ext.py
Bolt
Storm
H2O POJO
S3 Events Sparkling Water
Feat-Ext.py
Row
Incremental Batch
Large Batch

H2O Model Execution Pipeline – Batch & Real Time

AUTOMATE RELENTLESSLY
Automated processes are testable, less error prone, and clear away drudgery to make space for creativity.
STRIVE FOR REPRODUCABILITY
It enables results to be validated and built upon. Our data products touch the financial lives of millions.
BE OPEN
Build for openness, insist that your work be of value to others, and enjoy the network effects.
EXHIBIT TECHNICAL LEADERSHIP
Team leaders are hands-on and write great code. Performers see themselves as architects generating building
blocks of enduring value
Our Methodology Reflects a Commitment to Usability and Collaboration

• Free up our risk officers and data scientists to solve business problems, not
shepherd around individual tasks.
• Encodes the accepted best practices of the risk and modeling communities
• Building blocks have a unified API, allows developers to handle the newest
technologies, letting users to explore their business value
• Analysis is in code, hence reproducible, loggable, testable, and under version
control
Automation has many benefits

Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One

Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One

More Related Content

Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi, Capital One