Scala eXchange: Building robust data pipelines in Scala

Building robust data pipelines in
Scala: the Snowplow experience

Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow,
the open-source event analytics platform
based here in London [1]
• Weekend writer of Unified Log Processing,
available on the Manning Early Access Program
[2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean

Snowplow is an open source event analytics platform
1a. Trackers
2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D
A D Standardised data protocols
1b. Webhooks
A
• Your granular, event-level and customer-level
data, in your own data warehouse
• Connect any analytics tool to your data
• Join your event data with any other data set

Today almost all users/customers are running a batch-based
Snowplow configuration
Hadoop-
based
enrichment
Snowplow
event
tracking SDK
Amazon
Redshift
Amazon S3
HTTP-based
event
collector
• Batch-based
• Normally run overnight;
sometimes every 4-6 hours

We also have a real-time pipeline for Snowplow in beta, built on
Amazon Kinesis (Apache Kafka support coming next year)
scala-
stream-
collector
scala-
kinesis-
enrich
S3
Redshift
S3 sink
Kinesis app
Redshift
sink
Kinesis app
Snowplow
Trackers
= not yet released
kinesis-
elasticsearch-
sink
DynamoDB
Elastic-
search
Event
aggregator
Kinesis app
Analytics on
Read for agile
exploration of
events, machine
learning,
auditing, re-
processing…
Analytics on Write for operational
reporting, real-time dashboards,
audience segmentation,
personalization…
Raw
event
stream
Bad raw
event
stream
Enriched
event
stream

Today, Snowplow is primarily developed in Scala
Data modelling
scripts
• Used for Snowplow
orchestration
• No event-level
processing occurs in
Ruby
• Used for event
validation, enrichment
and other processing
• Increasingly used for
event storage
• Starting to be used for
event collection too

Our initial skunkworks version of Snowplow had no Scala 
Website / webapp
Snowplow data pipeline v1
CloudFront-
based pixel
collector
HiveQL +
Java UDF
“ETL”
Amazon S3
JavaScript
event tracker

But our schema-first, loosely coupled approach made it possible
to start swapping out existing components…
Website / webapp
Snowplow data pipeline v2
CloudFront-
based event
collector
Scalding-
based
enrichment
JavaScript
event tracker
HiveQL +
Java UDF
“ETL”
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector

What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Hive Pig
Java
Scalding Cascalog PyCascading
cascading.
jruby

We chose Cascading because we liked their “plumbing”
abstraction over vanilla MapReduce

Why did we choose Scalding instead of one of the other
Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the
ground running (only very basic awareness of Clojure
when we started the project)
• Scalding created and supported by Twitter, who use it
throughout their organization – so we knew it was a
safe long-term bet
• More controversial opinion (although maybe not at a
Scala conference): we believe that data pipelines
should be as strongly typed as possible – all the other
DSLs/APIs on top of Cascading encourage dynamic
typing

Robust data pipelines means strongly typed data pipelines –
why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:

Robust data processing is a state of mind: failures will happen,
don’t panic, but don’t sweep them under the carpet either
• Our basic processing model for Snowplow looks like this:
• Looks familiar? stdin, stdout, stderr
Raw events
Snowplow
enrichment
process
“Bad” raw
events +
reasons why
they are bad
“Good”
enriched
events

This pattern is extremely composable, especially with Kinesis or
Kafka streams/topics as the core building block

Validation, the “gateway
drug” to Scalaz

Inside and across our components, we use the Validation
applicative functor from the Scalaz project extensively
• Scalaz Validation lets us perform a variety of different event validations and
enrichments, and then compose (i.e. collate) the failures
• This is really powerful!
• The Scalaz codebase calls |@| a “DSL for constructing
Applicative expressions” – I think of it as “the Scream operator”
• Individual components of the enrichment process can themselves collate their
own internal failures

There is a great F# article by Scott Wlaschin which describes this
approach as “railway-oriented programming” [1]
The Happy Path
• If everything succeeds, then this path outputs an enriched event
• Any individual failure along the path could switch us onto the
failure path
• We never get back onto the happy path once we leave it
The Failure Path
• Any failure can take us onto the failure path
• We can choose whether to switch straight to the
failure path (“fail fast”), or collate failures from
multiple independent tests
[1] http://fsharpforfunandprofit.com/posts/recipe-part2/

Putting it all together, the Snowplow enrichment process boils
down to one big type transformation
• Types abstracting over simpler types
• No mutable state
• Railway-oriented programming
• Collate failures inside a processing stage, fail fast between processing stages

• Using Scott Wlaschin’s “fruit as cargo” metaphor:
• Currently Snowplow uses a Non-Empty List of Strings to collect our failures:
• We are working on a ProcessingMessage case class, to capture much richer and
more structured failures than we can using Strings
The only limitation is that the Failure Path restricts us to a single
type

On the testing side: we love Specs2 data tables…
• They let us test a variety of inputs and expected outputs without making the
mistake of just duplicating the data processing functionality in the test:

… and are starting to do more with ScalaCheck
• ScalaCheck is a property-based testing framework, originally inspired by
Haskell’s QuickCheck
• We use it in a few places –
including to generate
unpredictable bad data and
also to validate our new Thrift
schema for raw Snowplow
events:

Robustness in the face of
user-defined types

Snowplow is evolving from a fixed-schema platform to a
platform supporting user-defined JSONs
• Where other analytics tools depend on schema-less JSONs or custom variables,
we use JSON Schema
• Snowplow users send in events as “self-describing JSONs” which have to include
the schema URI which validates the event’s JSON body:

To support JSON Schema, we have open-sourced Iglu, a new
schema repository system in Scala/Spray/Swagger/Jackson

Our Scala client library for Iglu lets us work with JSONs in a safe
way from within Snowplow
• If a JSON passes its JSON Schema validation, we should be able to deserialize it
and work with it safely in Scala in a strongly-typed way:
• We use json4s with the Jackson bindings, as JSON Schema support in Java/Scala
is Jackson-based
• We still wrap our JSON deserialization in Scalaz Validations in case of any
mismatch between the Scala deserialization code and the JSON schema

Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To meet up or chat, @alexcrdean on Twitter or
alex@snowplowanalytics.com
Discount code: ulogprugcf (43% off
Unified Log Processing eBook)

Scala eXchange: Building robust data pipelines in Scala

More Related Content

Scala eXchange: Building robust data pipelines in Scala