Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Using approximate data structures
for small, insightful analytics.
Ben Kornmeier, Engineer
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
About Protectwise
● Cloud security platform, that aims to make threats
actionable and obvious.
● Aims to cut down on the amount of “noise” that a
network can create, and only show the most important
details.
● Has a big emphasis on real time data.
● Ingests and processes terabytes of data a day.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Goals Of Count Sumula
● Quick report generation.
● Support high cardinality data.
● Compute averages, min, and max.
● Easy to add additional aggregations.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Daily Data Ingestion
● 2 billion netflow updates.
● Ingests 20TB of raw network traffic.
● Generates 150 million observations.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Costs of Processing Data.
● Traditional batch processing is accurate, but slow.
○ We want results in seconds not hours or days.
● Compute resources are very expensive at our scale.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Making a Great User Experience
● A user should expect:
○ Hardly any waiting for report generate.
○ Up to date reports.
○ Meaningful reports that are actionable and concise.
○ Reports that are persisted forever and can be
recombined after the fact to gain additional insights.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Some Use Cases
● Show me a count all of the hosts that had a threat on
them in the past year.
● Show me the hosts with the most threats encountered
over the course of a year.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Use Cases Examined
● Show me a count all of the hosts that had a threat on
them in the past year.
○ IP address has a very high cardinality 340 undecillion (ipv6)
■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!)
○ Storage costs could be high.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Use Cases Examined Continued
● Show me the hosts with the most threats encountered
over the course of a year.
○ Once again, high cardinality.
○ Same storage costs as the example before, but now we have to sort,
which is going to be tough. O(n log n).
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Considerations For Our Solution
● Be real time.
● Could not grow without bounds.
● Data must be around for decades or more.
● Be able to return queries for large time ranges.
● Be actionable and concise.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
The Realization
● In general users can live with an approximate result!
○ Approximate results use less space.
○ Can be computed in memory.
○ Approximate results can be bounded by trading accuracy for space
○ Approximate results are fast enough to compute in real time.
○ Meets two of our goals.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Some Approximations We Used
● HyperLogLog
● Count Min Sketch
● Stream Summary
● Bloom Filter
● Layered Bloom Filter
● Compound Approximations
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
HyperLogLog
● Only counts the amount of consecutive 0 bits.
● Uses the count of consecutive 0 bits and the probability
of it occurring to determine an estimate of unique
elements seen.
● Assumes a good hashing function (Murmur 3).
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: HyperLogLog
Assuming our hashing function only returns 4 bits (16
combinations).
Bit pattern(s) Chance of occurrence
0000 1 / 16
1000, 0001 2 / 16 or 1 / 8
0011,1001,1100,0100,0010 5 / 16
0111,1011,1101,1110,1010,0110 7 / 16
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
CountMinSketch
● Essentially a matrix.
● Inserts are duplicated across rows.
● Inserts are hashed differently per row.
● Elements can only add.
● Used for frequency estimation.
● Can be used for averages, min, max as well.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: CountMinSketch
Inserting an element
“Ben”
“Eric”
1 null null null null
null null 1 null null
1 null 1 null null
null null 2 null null
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: CountMinSketch Continued
Retrieving the count for “Ben”
“Ben” 1 null 1 null null
null null 2 null null
Compare the values return, and take the min, in this case 1.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
How Did We Store The Approximations?
● We generate enough approximations that we create
about 1 GB of data each month.
○ Much better than the amount stored for full fidelity data.
● First approach just use Redis.
● Second approach Redis and Cassandra.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
First Approach Redis Only
Advantages
● Easy
● Fast
Disadvantages
● Ticking time bomb since Redis is memory only.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Second Approach C* And Redis
Advantages
● C* scales infinitely.
● Redis can be used when speed is important.
● Not a ticking time bomb.
Disadvantages
● Not as easy as previous solution.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
How We Use Redis With Cassandra
● Elements are placed in Redis and keyed on bucket
name and time.
● Once a element from the next time interval is
encountered, data is moved from Redis to Cassandra.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:1, “value”: 2}
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 1}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
Elements are
summed
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 3}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
Cassandra
Redis
{“bucket”: “observation”,”time”:2, “value”: 10}
{“bucket”: “observation”,”time”:1, “value”: 3}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
Cassandra
{“bucket”: “observation”,”time”:1, “value”: 3}
Redis
{“bucket”: “observation”,”time”:2, “value”: 10}
Element from time 1 is determined to be expired and written to Cassandra
Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Advantages of using Cassandra and Redis
● Elements are written in their finalized form to Cassandra.
○ Compactor friendly.
● Updates can happen very fast since Redis is Fast.
● Redis no longer consumes memory unbounded.
Caveats
● Using approximations are just that, approximate.
● Takes time to understand how they work.
● Tuning needs up front knowledge of usage.
https://www.protectwise.com/careers.html
Especially if you’re in Denver!
We’re Hiring!

More Related Content

Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016

  • 1. Using approximate data structures for small, insightful analytics. Ben Kornmeier, Engineer
  • 2. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. About Protectwise ● Cloud security platform, that aims to make threats actionable and obvious. ● Aims to cut down on the amount of “noise” that a network can create, and only show the most important details. ● Has a big emphasis on real time data. ● Ingests and processes terabytes of data a day.
  • 3. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Goals Of Count Sumula ● Quick report generation. ● Support high cardinality data. ● Compute averages, min, and max. ● Easy to add additional aggregations.
  • 4. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Challenge: Daily Data Ingestion ● 2 billion netflow updates. ● Ingests 20TB of raw network traffic. ● Generates 150 million observations.
  • 5. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Challenge: Costs of Processing Data. ● Traditional batch processing is accurate, but slow. ○ We want results in seconds not hours or days. ● Compute resources are very expensive at our scale.
  • 6. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Challenge: Making a Great User Experience ● A user should expect: ○ Hardly any waiting for report generate. ○ Up to date reports. ○ Meaningful reports that are actionable and concise. ○ Reports that are persisted forever and can be recombined after the fact to gain additional insights.
  • 7. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Some Use Cases ● Show me a count all of the hosts that had a threat on them in the past year. ● Show me the hosts with the most threats encountered over the course of a year.
  • 8. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Use Cases Examined ● Show me a count all of the hosts that had a threat on them in the past year. ○ IP address has a very high cardinality 340 undecillion (ipv6) ■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!) ○ Storage costs could be high.
  • 9. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Use Cases Examined Continued ● Show me the hosts with the most threats encountered over the course of a year. ○ Once again, high cardinality. ○ Same storage costs as the example before, but now we have to sort, which is going to be tough. O(n log n).
  • 10. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Considerations For Our Solution ● Be real time. ● Could not grow without bounds. ● Data must be around for decades or more. ● Be able to return queries for large time ranges. ● Be actionable and concise.
  • 11. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. The Realization ● In general users can live with an approximate result! ○ Approximate results use less space. ○ Can be computed in memory. ○ Approximate results can be bounded by trading accuracy for space ○ Approximate results are fast enough to compute in real time. ○ Meets two of our goals.
  • 12. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Some Approximations We Used ● HyperLogLog ● Count Min Sketch ● Stream Summary ● Bloom Filter ● Layered Bloom Filter ● Compound Approximations
  • 13. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. HyperLogLog ● Only counts the amount of consecutive 0 bits. ● Uses the count of consecutive 0 bits and the probability of it occurring to determine an estimate of unique elements seen. ● Assumes a good hashing function (Murmur 3).
  • 14. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Example: HyperLogLog Assuming our hashing function only returns 4 bits (16 combinations). Bit pattern(s) Chance of occurrence 0000 1 / 16 1000, 0001 2 / 16 or 1 / 8 0011,1001,1100,0100,0010 5 / 16 0111,1011,1101,1110,1010,0110 7 / 16
  • 15. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. CountMinSketch ● Essentially a matrix. ● Inserts are duplicated across rows. ● Inserts are hashed differently per row. ● Elements can only add. ● Used for frequency estimation. ● Can be used for averages, min, max as well.
  • 16. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Example: CountMinSketch Inserting an element “Ben” “Eric” 1 null null null null null null 1 null null 1 null 1 null null null null 2 null null
  • 17. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Example: CountMinSketch Continued Retrieving the count for “Ben” “Ben” 1 null 1 null null null null 2 null null Compare the values return, and take the min, in this case 1.
  • 18. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. How Did We Store The Approximations? ● We generate enough approximations that we create about 1 GB of data each month. ○ Much better than the amount stored for full fidelity data. ● First approach just use Redis. ● Second approach Redis and Cassandra.
  • 19. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. First Approach Redis Only Advantages ● Easy ● Fast Disadvantages ● Ticking time bomb since Redis is memory only.
  • 20. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Second Approach C* And Redis Advantages ● C* scales infinitely. ● Redis can be used when speed is important. ● Not a ticking time bomb. Disadvantages ● Not as easy as previous solution.
  • 21. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. How We Use Redis With Cassandra ● Elements are placed in Redis and keyed on bucket name and time. ● Once a element from the next time interval is encountered, data is moved from Redis to Cassandra.
  • 22. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:1, “value”: 1} {“bucket”: “observation”,”time”:1, “value”: 2} {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis
  • 23. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:1, “value”: 2} {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 1}
  • 24. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 1} {“bucket”: “observation”,”time”:1, “value”: 2}
  • 25. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 1} {“bucket”: “observation”,”time”:1, “value”: 2} Elements are summed
  • 26. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 3}
  • 27. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates Cassandra Redis {“bucket”: “observation”,”time”:2, “value”: 10} {“bucket”: “observation”,”time”:1, “value”: 3}
  • 28. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates Cassandra {“bucket”: “observation”,”time”:1, “value”: 3} Redis {“bucket”: “observation”,”time”:2, “value”: 10} Element from time 1 is determined to be expired and written to Cassandra
  • 29. Cassandra Schema CREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
  • 30. Cassandra Schema CREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
  • 31. Cassandra Schema CREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
  • 32. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Advantages of using Cassandra and Redis ● Elements are written in their finalized form to Cassandra. ○ Compactor friendly. ● Updates can happen very fast since Redis is Fast. ● Redis no longer consumes memory unbounded.
  • 33. Caveats ● Using approximations are just that, approximate. ● Takes time to understand how they work. ● Tuning needs up front knowledge of usage.