Running a Cassandra cluster in AWS that can store petabytes worth of data can be costly. This talk will detail the novel approach of using approximate data structures to keep costs low, yet retain insightful, and up to date query results. The talk will explore a number of real world examples from our environment to demonstrate the power of approximate data. It will cover: determining how many IP addresses are on a network, ranking IPs by traffic, and finally determining approximate min, max, and averages on values. The talk will also cover how this data is laid out in Cassandra, so that a query always returns up to date data, without burdening the compactor.
About the Speaker
Ben Kornmeier Engineer, ProtectWise
Ben is a Staff Engineer at ProtectWise. When he is not building realtime processing pipelines, he enjoys hiking, biking, and keeping his dog out of trouble.
1 of 34
More Related Content
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016
1. Using approximate data structures
for small, insightful analytics.
Ben Kornmeier, Engineer
29. Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
30. Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
31. Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)