Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016

Using approximate data structures
for small, insightful analytics.
Ben Kornmeier, Engineer

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
About Protectwise
● Cloud security platform, that aims to make threats
actionable and obvious.
● Aims to cut down on the amount of “noise” that a
network can create, and only show the most important
details.
● Has a big emphasis on real time data.
● Ingests and processes terabytes of data a day.

Goals Of Count Sumula
● Quick report generation.
● Support high cardinality data.
● Compute averages, min, and max.
● Easy to add additional aggregations.

Challenge: Daily Data Ingestion
● 2 billion netflow updates.
● Ingests 20TB of raw network traffic.
● Generates 150 million observations.

Challenge: Costs of Processing Data.
● Traditional batch processing is accurate, but slow.
○ We want results in seconds not hours or days.
● Compute resources are very expensive at our scale.

Challenge: Making a Great User Experience
● A user should expect:
○ Hardly any waiting for report generate.
○ Up to date reports.
○ Meaningful reports that are actionable and concise.
○ Reports that are persisted forever and can be
recombined after the fact to gain additional insights.

Some Use Cases
● Show me a count all of the hosts that had a threat on
them in the past year.
● Show me the hosts with the most threats encountered
over the course of a year.

Use Cases Examined
● Show me a count all of the hosts that had a threat on
them in the past year.
○ IP address has a very high cardinality 340 undecillion (ipv6)
■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!)
○ Storage costs could be high.

Use Cases Examined Continued
● Show me the hosts with the most threats encountered
over the course of a year.
○ Once again, high cardinality.
○ Same storage costs as the example before, but now we have to sort,
which is going to be tough. O(n log n).

Considerations For Our Solution
● Be real time.
● Could not grow without bounds.
● Data must be around for decades or more.
● Be able to return queries for large time ranges.
● Be actionable and concise.

The Realization
● In general users can live with an approximate result!
○ Approximate results use less space.
○ Can be computed in memory.
○ Approximate results can be bounded by trading accuracy for space
○ Approximate results are fast enough to compute in real time.
○ Meets two of our goals.

Some Approximations We Used
● HyperLogLog
● Count Min Sketch
● Stream Summary
● Bloom Filter
● Layered Bloom Filter
● Compound Approximations

HyperLogLog
● Only counts the amount of consecutive 0 bits.
● Uses the count of consecutive 0 bits and the probability
of it occurring to determine an estimate of unique
elements seen.
● Assumes a good hashing function (Murmur 3).

Example: HyperLogLog
Assuming our hashing function only returns 4 bits (16
combinations).
Bit pattern(s) Chance of occurrence
0000 1 / 16
1000, 0001 2 / 16 or 1 / 8
0011,1001,1100,0100,0010 5 / 16
0111,1011,1101,1110,1010,0110 7 / 16

CountMinSketch
● Essentially a matrix.
● Inserts are duplicated across rows.
● Inserts are hashed differently per row.
● Elements can only add.
● Used for frequency estimation.
● Can be used for averages, min, max as well.

Example: CountMinSketch
Inserting an element
“Ben”
“Eric”
1 null null null null
null null 1 null null
1 null 1 null null

Example: CountMinSketch Continued
Retrieving the count for “Ben”
“Ben” 1 null 1 null null
Compare the values return, and take the min, in this case 1.

How Did We Store The Approximations?
● We generate enough approximations that we create
about 1 GB of data each month.
○ Much better than the amount stored for full fidelity data.
● First approach just use Redis.
● Second approach Redis and Cassandra.

First Approach Redis Only
Advantages
● Easy
● Fast
Disadvantages
● Ticking time bomb since Redis is memory only.

Second Approach C* And Redis
Advantages
● C* scales infinitely.
● Redis can be used when speed is important.
● Not a ticking time bomb.
Disadvantages
● Not as easy as previous solution.

How We Use Redis With Cassandra
● Elements are placed in Redis and keyed on bucket
name and time.
● Once a element from the next time interval is
encountered, data is moved from Redis to Cassandra.

Incoming Updates
{“bucket”: “observation”,”time”:1, “value”: 1}
Cassandra
Redis

Incoming Updates
Cassandra
Redis

Incoming Updates
Cassandra
Redis
Elements are
summed

Incoming Updates
Cassandra
Redis

Incoming Updates
Cassandra
Redis
Element from time 1 is determined to be expired and written to Cassandra

Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)

Advantages of using Cassandra and Redis
● Elements are written in their finalized form to Cassandra.
○ Compactor friendly.
● Updates can happen very fast since Redis is Fast.
● Redis no longer consumes memory unbounded.

Caveats
● Using approximations are just that, approximate.
● Takes time to understand how they work.
● Tuning needs up front knowledge of usage.

https://www.protectwise.com/careers.html
Especially if you’re in Denver!
We’re Hiring!

Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016

More Related Content

Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016