Databricks Spark Reference Applications
Databricks Spark Reference Applications
1. Introduction
2. Log Analysis with Spark
i. Section 1: Introduction to Apache Spark
i. First Log Analyzer in Spark
ii. Spark SQL
iii. Spark Streaming
i. Windowed Calculations: window()
ii. Cumulative Calculations: updateStateByKey()
iii. Reusing Code from Batching: transform()
ii. Section 2: Importing Data
i. Batch Import
i. Importing from Files
i. S3
ii. HDFS
ii. Importing from Databases
ii. Streaming Import
i. Built In Methods for Streaming Import
ii. Kafka
iii. Section 3: Exporting Data
i. Small Datasets
ii. Large Datasets
i. Save the RDD to Files
ii. Save the RDD to a Database
iv. Section 4: Log Analyzer Application
3. Twitter Streaming Language Classifier
i. Collect a Dataset of Tweets
ii. Examine the Tweets and Train a Model
i. Examine with Spark SQL
ii. Train with Spark MLLib
iii. Run Examine And Train
iii. Apply the Model in Real-time
More to come...
While that's all for now, there's definitely more to come over time.
map
ApacheAccessLog RDD is cached in memory, since multiple transformations and actions will be called on it.
It's useful to define a sum reducer - this is a function that takes in two integers and returns their sum. This is used all over
our example.
Next, let's calculate the average, minimum, and maximum content size of the response returned. A
extracts the content sizes, and then different actions ( reduce ,
Again, call
cache
count
min
, and
max )
map
transformation
on the context size RDD to avoid recalculating those values for each action called on it.
To compute the response code counts, we have to work with key-value pairs - by using
Notice that we call
take(100)
instead of
collect()
collect()
mapToPair
and
reduceByKey .
to gather the final output of the response code counts. Use extreme
on an RDD since all that data will be sent to a single Spark driver and can cause the
driver to run out of memory. Even in this case where there are only a limited number of response codes and it seems safe
- if there are malformed lines in the Apache access log or a bug in the parser, there could be many invalid response
codes to cause an.
To compute any IPAddress that has accessed this server more than 10 times, we call the
map
to retrieve only the IPAddress and discard the count. Again we use
take(100)
filter
List<String> ipAddresses =
accessLogs.mapToPair(log -> new Tuple2<>(log.getIpAddress(), 1L))
.reduceByKey(SUM_REDUCER)
.filter(tuple -> tuple._2() > 10)
.map(Tuple2::_1)
.take(100);
System.out.println(String.format("IPAddresses > 10 times: %s", ipAddresses));
Last, let's calculate the top endpoints requested in this log file. We define an inner class,
ValueComparator
to help with
that. This function tells us, given two tuples, which one is first in ordering. The key of the tuple is ignored, and ordering is
based just on the values.
ValueComparator
with the
top
These code snippets are from LogAnalyzer.java. Now that we've walked through the code, try running that example. See
the README for language specific instructions for building and running.
Spark SQL
You should go through the Spark SQL Guide before beginning this section.
This section requires an additioal dependency on Spark SQL:
For those of you who are familiar with SQL, the same statistics we calculated in the previous example can be done using
Spark SQL rather than calling Spark transformations and actions directly. We walk through how to do that here.
First, we need to create a SQL Spark context. Note how we create one Spark Context, and then use that to instantiate
different flavors of Spark contexts. You should not initialize multiple Spark contexts from the SparkConf in one process.
Next, we need a way to register our logs data into a table. In Java, Spark SQL can infer the table schema on a standard
Java POJO - with getters and setters as we've done with ApacheAccessLog.java. (Note: if you are using a different
language besides Java, there is a different way for Spark to infer the table schema. The examples in this directory work
out of the box. Or you can also refer to the Spark SQL Guide on Data Sources for more details.)
Now, we are ready to start running some SQL queries on our table. Here's the code to compute the identical statistics in
the previous section - it should look very familiar for those of you who know SQL:
Note that the default SQL dialect does not allow using reserved keyworks as alias names. In other words,
AS count
SELECT COUNT(*)
runs fine. If you use the HiveQL parser though, then you
Spark Streaming
Go through the Spark Streaming Programming Guide before beginning this section. In particular, it covers the concept of
DStreams.
This section requires another dependency on the Spark Streaming library:
The earlier examples demonstrates how to compute statistics on an existing log file - but not how to do realtime
monitoring of logs. Spark Streaming enables that functionality.
To run the streaming examples, you will
tail
netcat
into Spark in a production system, but is an easy workaround for a first Spark Streaming example. We will cover best
practices for how to import data for Spark Streaming in Chapter 2.
In a terminal window, just run this command on a logfile which you will append to:
If you don't have a live log file that is being updated on the fly, you can add lines manually with the included data file or
another your own log file:
When data is streamed into Spark, there are two common use cases covered:
1. Windowed Calculations means that you only care about data received in the last N amount of time. When monitoring
your web servers, perhaps you only care about what has happened in the last hour.
Spark Streaming conveniently splits the input data into the desired time windows for easy processing, using the
window
The
forEachRDD
function allows you to access the RDD's created each time interval.
2. Cumulative Calculations means that you want to keep cumulative statistics, while streaming in new data to refresh
those statistics. In that case, you need to maintain the state for those statistics.
The Spark Streaming library has some convenient functions for maintaining state to support this use case,
updateStateByKey .
3. Reusing code from Batching covers how to should organize business logic code from the batch examples so that
code can be reused in Spark Streaming.
The Spark Streaming library has
transform
window
function.
The first step is to initalize the SparkConf and context objects - in particular a streaming context. Note how only one
SparkContext is created from the conf and the streaming and sql contexts are created from those. Next, the main body
should be written. Finally, the example calls
start()
awaitTermination()
to keep the
The first step of the main body is to create a DStream from reading the socket.
JavaReceiverInputDStream<String> logDataDStream =
jssc.socketTextStream("localhost", 9999);
map
JavaDStream<ApacheAccessLog> accessLogDStream =
logDataDStream.map(ApacheAccessLog::parseFromLogLine).cache();
Next, call
window
on the accessLogDStream to create a windowed DStream. The window function nicely packages the
input data that is being streamed into RDDs containing a window length of data, and creates a new RDD every
SLIDE_INTERVAL of time.
JavaDStream<ApacheAccessLog> windowDStream =
accessLogDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);
Then call
foreachRDD
forEachRDD
windowDStream as the RDD is created, so every slide_interval. The RDD passed into the function contains all the input
for the last window_length of time. Now that there is an RDD of ApacheAccessLogs, simply reuse code from either two
batch examples (regular or SQL). In this example, the code was just copied and pasted, but you could refactor this code
into one place nicely for reuse in your production code base - you can reuse all your batch processing code for
streaming!
windowDStream.foreachRDD(accessLogs -> {
if (accessLogs.count() == 0) {
System.out.println("No access logs in this time interval");
return null;
}
// Insert code verbatim from LogAnalyzer.java or LogAnalyzerSQL.java here.
// Calculate statistics based on the content size.
JavaRDD<Long> contentSizes =
accessLogs.map(ApacheAccessLog::getContentSize).cache();
System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
contentSizes.reduce(SUM_REDUCER) / contentSizes.count(),
contentSizes.min(Comparator.naturalOrder()),
contentSizes.max(Comparator.naturalOrder())));
//...Won't copy the rest here...
}
Now that we've walked through the code, run LogAnalyzerStreaming.java and/or LogAnalyzerStreamingSQL.java now.
Use the
cat
command as explained before to add data to the log file periodically once you have your program up.
updateStateByKey
updateStateByKey , checkpointing
on the streaming context with a directory to write the checkpoint data. Here is part of the main function of a
To compute the content size statistics, simply use static variables to save the current running sum, count, min and max of
the content sizes.
To update those values, first call map on the AccessLogDStream to retrieve a contentSizeDStream. Then just update the
values for the static variables by calling foreachRDD on the contentSizeDstream, and calling actions on the RDD:
JavaDStream<Long> contentSizeDStream =
accessLogDStream.map(ApacheAccessLog::getContentSize).cache();
contentSizeDStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
runningSum.getAndAdd(rdd.reduce(SUM_REDUCER));
runningCount.getAndAdd(rdd.count());
runningMin.set(Math.min(runningMin.get(), rdd.min(Comparator.naturalOrder())));
runningMax.set(Math.max(runningMax.get(), rdd.max(Comparator.naturalOrder())));
System.out.print("Content Size Avg: " + runningSum.get() / runningCount.get());
System.out.print(", Min: " + runningMin.get());
System.out.println(", Max: " + runningMax.get());
}
return null;
});
For the other statistics, since they make use of key value pairs, static variables can't be used anymore. The amount of
state that needs to be maintained is potentially too big to fit in memory. So for those stats, we'll make use of
updateStateByKey
so Spark streaming will maintain a value for every key in our dataset.
updateStateByKey , we
updateStateByKey
takes in a different
reduce function. While our previous sum reducer just took in two values and output their sum, this reduce function takes
in a current value and an iterator of values, and outputs one new value.
Finally, we can compute the keyed statistics for all of time with this code:
transform
transform
functions which allows you to call any arbitrary RDD to RDD functions to RDD's in the
functions are perfect for reusing any RDD to RDD functions that you may have written in batch
code and want to port over to streaming. Let's look at some code to illustrate this point.
Let's say we have separated out a function,
responseCodeCount
forEachRDD
transformToPair
updateStateByKey
with the
responseCodeCount
function to the
It is possible to combine
transform
updateStateByKey
as well:
Take a closer look at LogAnalyzerStreamingTotalRefactored.java now to see how that code has been refactored to reuse
code from the batch example.
rsync
NFS or some other network file system makes sure all your machines can access the same files without requiring you to
copy the files around. But NFS isn't fault tolerant to machine failures and if your dataset is too big to fit on one NFS volume
- you'd have to store the data on multiple volumes and figure out which volume a particular file is on - which could get
cumbersome.
HDFS and S3 are great file systems for massive datasets - built to store a lot of data and give all the machines on the
cluster access to those files, while still being fault tolerant. We give a few more tips on running Spark with these file
systems since they are recommended.
S3 is an Amazon AWS solution for storing files in the cloud, easily accessible to anyone who signs up for an account.
HDFS is a distributed file system that is part of Hadoop and can be installed on your own datacenters.
The good news is that regardless of which of these file systems you choose, you can run the same code to read from
them - these file systems are all "Hadoop compatible" file systems.
In this section, you should try running LogAnalyzerBatchImport.java on any files on your file system of choice. There is
nothing new in this code - it's just a refactor of the First Log Analyzer from Chapter One. Try passing in "*" or "?" for the
textFile path, and Spark will read in all the files that match that pattern to create the RDD.
S3
S3 is Amazon Web Services's solution for storing large files in the cloud. On a production system, you want your Amazon
EC2 compute nodes on the same zone as your S3 files for speed as well as cost reasons. While S3 files can be read from
other machines, it would take a long time and be expensive (Amazon S3 data transfer prices differ if you read data within
AWS vs. to somewhere else on the internet).
See running Spark on EC2 if you want to launch a Spark cluster on AWS - charges apply.
If you choose to run this example with a local Spark cluster on your machine rather than EC2 compute nodes to read the
files in S3, use a small data input source!
1. Sign up for an Amazon Web Services Account.
2. Load example log files to s3.
Log into the AWS console for S3
Create an S3 bucket.
Upload a couple of example log files to that bucket.
Your files will be at the path: s3n://YOUR_BUCKET_NAME/YOUR_LOGFILE.log
3. Configure your security credentials for AWS:
Create and download your security credentials
Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the correct
values on all machines on your cluster. These can also be set in your SparkContext object programmatically like
this:
jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
HDFS
HDFS is a file system that is meant for storing large data sets and being fault tolerant. In a production system, your Spark
cluster should ideally be on the same machines as your Hadoop cluster to make it easy to read files. The Spark binary
you run on your clusters must be compiled with the same HDFS version as the one you wish to use.
There are many ways to install HDFS, but heading to the Hadoop homepage is one way to get started and run hdfs
locally on your machine.
Run LogAnalyzerBatchImport.java on any file pattern on your hdfs directory.
textFileStream
textFileStream
socketTextStream
filesystem directory for new files and when it detects a new file - reads it into Spark Streaming. Just replace the call to
socketTextStream
with
textFileStream
Try running LogAnalyzerStreamingImportDirectory.java by specifying a directory. You'll also need to drop or copy some
new log files into that directory while the program is running to see the calculated values update.
There are more built-in input methods for streaming - check them out in the reference API documents for the
StreamingContext.
Kafka
While the previous example picks up new log files right away - the log files aren't copied over until a long time after the
HTTP requests in the logs actually occurred. While that enables auto-refresh of log data, that's still not realtime. To get
realtime logs processing, we need a way to send over log lines immediately. Kafka is a high-throughput distributed
message system that is perfect for that use case. Spark contains an external module importing data from Kafka.
Here is some useful documentation to set up Kafka for Spark Streaming:
Kafka Documentation
KafkaUtils class in the external module of the Spark project - This is the external module that has been written that
imports data from Kafka into Spark Streaming.
Spark Streaming Example of using Kafka - This is an example that demonstrates how to call KafkaUtils.
take(N)
collect()
output fits in memory - no matter how big the input data set may be - this is good practice. This section walks through
example code where you'll write the log statistics to a file.
It may not be that useful to have these stats output to a file - in practice, you might write these statistics to a database for
your presentation layer to access.
Now, run LogAnalyzerExportSmallData.java. Try modifying it to write to a database of your own choosing.
collect()
or a similar action to read all the data from the RDD onto
the single driver program - that could trigger out of memory problems. Instead, you have to be careful about saving a
large RDD. See these two sections for more information.
Save the RDD to Files - There are built in methods in Spark for saving a large RDD to files.
Save the RDD to a Database - This section contains recommended best practices for saving a large RDD to a
database.
.toString()
.saveAsTextFile()
method is called on each RDD element and one element is written per line. The number of files
output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control
the number of output files.
Run LogAnalyzerExportRDD.java now. Notice that the number of files output is the same as the number of partitionds of
the RDD.
Refer to the API documentation for other built in methods for saving to file. There are different built in methods for saving
RDD's to files in various formats, so skim the whole RDD package to see if there is something to suit your needs.
Sqoop is a very useful tool that can import Hadoop files into various databases, and is thus very useful to use for getting
the data written into files from Spark into your production database.
forEachPartition
and
You can use this simple application as a skeleton and combine features from the chapters to produce your own custom
logs analysis application. The main class is LogAnalyzerAppMain.java.
Here are 5 typical stages for creating a production ready classifer - oftentimes each stage is done with a different set of
tools and even by different engineering teams:
1. Scrape/collect a dataset.
2. Clean and explore the data, doing feature extraction.
3. Build a model on the data and iterate/improve it.
4. Improve the model using more and more data, perhaps upgrading your infrastructure to support building larger
models. (Such as migrating over to Hadoop.)
5. Apply the model in real time.
Spark can be used for all of the above and simple to use for all these purposes. We've chosen to break up the language
classifier into 3 parts with one simple Spark program to accomplish each part:
1. Collect a Dataset of Tweets - Spark Streaming is used to collect a dataset of tweets and write them out to files.
2. Examine the Tweets and Train a Model - Spark SQL is used to examine the dataset of Tweets. Then Spark MLLib is
used to apply KMeans algorithm to train a model on the data.
3. Apply the Model in Real-time - Spark Streaming and Spark MLLib are used to filter a live stream of Tweets for those
that match the specified cluster.
% ${YOUR_SPARK_HOME}/bin/spark-submit \
--class "com.databricks.apps.twitter_classifier.Collect" \
--master ${YOUR_SPARK_MASTER:-local[4]} \
target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
${YOUR_OUTPUT_DIR:-/tmp/tweets} \
${NUM_TWEETS_TO_COLLECT:-10000} \
${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \
${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \
--consumerKey ${YOUR_TWITTER_CONSUMER_KEY} \
--consumerSecret ${YOUR_TWITTER_CONSUMER_SECRET} \
--accessToken ${YOUR_TWITTER_ACCESS_TOKEN} \
--accessTokenSecret ${YOUR_TWITTER_ACCESS_SECRET}
Spark SQL can load JSON files and infer the schema based on that data. Here is the code to load the json files, register
the data in the temp table called "tweetTable" and print out the schema based on that.
sqlContext.sql(
"SELECT text FROM tweetTable LIMIT 10")
.collect().foreach(println)
View the user language, user name, and text for 10 sample tweets.
sqlContext.sql(
"SELECT user.lang, user.name, text FROM tweetTable LIMIT 10")
.collect().foreach(println)
Finally, show the count of tweets by user language. This can help determine the number of clusters is ideal for this
dataset of tweets.
sqlContext.sql(
"SELECT user.lang, COUNT(*) as cnt FROM tweetTable " +
"GROUP BY user.lang ORDER BY cnt DESC limit 1000")
.collect.foreach(println)
object Utils {
...
val numFeatures = 1000
val tf = new HashingTF(numFeatures)
/**
* Create feature vectors by turning each tweet into bigrams of
* characters (an n-gram model) and then hashing those to a
* length-1000 feature vector that we can pass to MLlib.
* This is a common way to decrease the number of features in a
* model while still getting excellent accuracy (otherwise every
* pair of Unicode characters would potentially be a feature).
*/
def featurize(s: String): Vector = {
tf.transform(s.sliding(2).toSeq)
}
...
}
This is the code that actually grabs the tweet text from the tweetTable and featurizes them. KMeans is called to create the
number of clusters and the algorithm is applied the specified number of iterations. FInally, the trained model is persisted
so it can be loaded later.
Last, here is some code to take a sample set of tweets and print them out by cluster, we can see what language clusters
our model contains. Pick your favorite to use for part 3.
% ${YOUR_SPARK_HOME}/bin/spark-submit \
--class "com.databricks.apps.twitter_classifier.ExamineAndTrain" \
--master ${YOUR_SPARK_MASTER:-local[4]} \
target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
"${YOUR_TWEET_INPUT:-/tmp/tweets/tweets*/part-*}" \
${OUTPUT_MODEL_DIR:-/tmp/tweets/model} \
${NUM_CLUSTERS:-10} \
${NUM_ITERATIONS:-20}
% ${YOUR_SPARK_HOME}/bin/spark-submit \
--class "com.databricks.apps.twitter_classifier.Predict" \
--master ${YOUR_SPARK_MASTER:-local[4]} \
target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
${YOUR_MODEL_DIR:-/tmp/tweets/model} \
${CLUSTER_TO_FILTER:-7} \
--consumerKey ${YOUR_TWITTER_CONSUMER_KEY} \
--consumerSecret ${YOUR_TWITTER_CONSUMER_SECRET} \
--accessToken ${YOUR_TWITTER_ACCESS_TOKEN} \
--accessTokenSecret ${YOUR_TWITTER_ACCESS_SECRET}