0% found this document useful (0 votes)

232 views

Databricks Spark Reference Applications

This document provides an overview and instructions for a log analysis application built with Apache Spark. It contains 4 sections: 1) an introduction to Apache Spark, 2) importing data, 3) exporting data, and 4) a log analyzer application that combines concepts from the previous sections. The first section demonstrates loading log data as RDDs, calculating statistics using transformations and actions, and introduces Spark SQL and Spark Streaming. It provides code examples in Java, Scala, and Python.

Uploaded by

jose

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

232 views

Databricks Spark Reference Applications

Uploaded by

jose

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

1. Introduction
2. Log Analysis with Spark
i. Section 1: Introduction to Apache Spark
i. First Log Analyzer in Spark
ii. Spark SQL
iii. Spark Streaming
i. Windowed Calculations: window()
ii. Cumulative Calculations: updateStateByKey()
iii. Reusing Code from Batching: transform()
ii. Section 2: Importing Data
i. Batch Import
i. Importing from Files
i. S3
ii. HDFS
ii. Importing from Databases
ii. Streaming Import
i. Built In Methods for Streaming Import
ii. Kafka
iii. Section 3: Exporting Data
i. Small Datasets
ii. Large Datasets
i. Save the RDD to Files
ii. Save the RDD to a Database
iv. Section 4: Log Analyzer Application
3. Twitter Streaming Language Classifier
i. Collect a Dataset of Tweets
ii. Examine the Tweets and Train a Model
i. Examine with Spark SQL
ii. Train with Spark MLLib
iii. Run Examine And Train
iii. Apply the Model in Real-time

Databricks Reference Apps

At Databricks, we are developing a set of reference applications that demonstrate how to use Apache Spark. This
book/repo contains the reference applications.
View the code in the Github Repo here: https://github.com/databricks/reference-apps
Read the documentation here: http://databricks.gitbooks.io/databricks-spark-reference-applications/
Submit feedback or issues here: https://github.com/databricks/reference-apps/issues
The reference applications will appeal to those who want to learn Spark and learn better by example. Browse the
applications, see what features of the reference applications are similar to the features you want to build, and refashion
the code samples for your needs. Additionally, this is meant to be a practical guide for using Spark in your systems, so the
applications mention other technologies that are compatible with Spark - such as what file systems to use for storing your
massive data sets.
Log Analysis Application - The log analysis reference application contains a series of tutorials for learning Spark by
example as well as a final application that can be used to monitor Apache access logs. The examples use Spark in
batch mode, cover Spark SQL, as well as Spark Streaming.
Twitter Streaming Language Classifier - This application demonstrates how to fetch and train a language classifier
for Tweets using Spark MLLib. Then Spark Streaming is used to call the trained classifier and filter out live tweets that
match a specified cluster. To build this example go into the twitter_classifier/scala and follow the direction in the
README.
This reference app is covered by license terms covered here.

Log Analysis with Spark

This project demonstrates how easy it is to do log analysis with Apache Spark.
Log analysis is an ideal use case for Spark. It's a very large, common data source and contains a rich set of information.
Spark allows you to store your logs in files to disk cheaply, while still providing a quick and simple way to process them.
We hope this project will show you how to use Apache Spark on your organization's production logs and fully harness the
power of that data. Log data can be used for monitoring your servers, improving business and customer intelligence,
building recommendation systems, preventing fraud, and much more.

How to use this project

This project is broken up into sections with bite-sized examples for demonstrating new Spark functionality for log
processing. This makes the examples easy to run and learn as they cover just one new topic at a time. At the end, we
assemble some of these examples to form a sample log analysis application.

Section 1: Introduction to Apache Spark

The Apache Spark library is introduced, as well as Spark SQL and Spark Streaming. By the end of this chapter, a reader
will know how to call transformations and actions and work with RDDs and DStreams.

Section 2: Importing Data

This section includes examples to illustrate how to get data into Spark and starts covering concepts of distributed
computing. The examples are all suitable for datasets that are too large to be processed on one machine.

Section 3: Exporting Data

This section includes examples to illustrate how to get data out of Spark. Again, concepts of a distributed computing
environment are reinforced, and the examples are suitable for large datasets.

Section 4: Logs Analyzer Application

This section puts together some of the code in the other chapters to form a sample log analysis application.

More to come...
While that's all for now, there's definitely more to come over time.

Section 1: Introduction to Apache Spark

In this section, we demonstrate how simple it is to analyze web logs using Apache Spark. We'll show how to load a
Resilient Distributed Dataset (RDD) of access log lines and use Spark tranformations and actions to compute some
statistics for web server monitoring. In the process, we'll introduce the Spark SQL and the Spark Streaming libraries.
In this explanation, the code snippets are in Java 8. However, there is also sample code in Java 6, Scala, and Python
included in this directory. In those folders are README's for instructions on how to build and run those examples, and the
necessary build files with all the required dependencies.
This chapter covers the following topics:
1. First Log Analyzer in Spark - This is a first Spark standalone logs analysis application.
2. Spark SQL - This example does the same thing as the above example, but uses SQL syntax instead of Spark
transformations and actions.
3. Spark Streaming - This example covers how to calculate log statistics using the streaming library.

First Logs Analyzer in Spark

Before beginning this section, go through Spark Quick Start and familiarize with the Spark Programming Guide first.
This section requires a dependency on the Spark Core library in the maven file - note update this dependency based on
the version of Spark you have installed:

<dependency>

<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.0</version>
</dependency>

Before we can begin, we need two things:

An Apache access log file: If you have one, it's more interesting to use real data.
This is trivial sample one provided at data/apache.access.log.
Or download a better example here: http://www.monitorware.com/en/logsamples/apache.php
A parser and model for the log file: See ApacheAccessLog.java.
The example code uses an Apache access log file since that's a well known and common log format. It would be easy to
rewrite the parser for a different log format if you have data in another log format.
The following statistics will be computed:
The average, min, and max content size of responses returned from the server.
A count of response code's returned.
All IPAddresses that have accessed this server more than N times.
The top endpoints requested by count.
Let's first walk through the code first before running the example at LogAnalyzer.java.
The main body of a simple Spark application is below. The first step is to bring up a Spark context. Then the Spark context
can load data from a text file as an RDD, which it can then process. Finally, before exiting the function, the Spark context
is stopped.

public class LogAnalyzer {

public static void main(String[] args) {
// Create a Spark Context.
SparkConf conf = new SparkConf().setAppName("Log Analyzer");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load the text file into Spark.
if (args.length == 0) {
System.out.println("Must specify an access logs file.");
System.exit(-1);
}
String logFile = args[0];
JavaRDD<String> logLines = sc.textFile(logFile);
// TODO: Insert code here for processing logs.
sc.stop();
}
}

Given an RDD of log lines, use the

map

function to transform each line to an ApacheAccessLog object. The

ApacheAccessLog RDD is cached in memory, since multiple transformations and actions will be called on it.

// Convert the text log lines to ApacheAccessLog objects and

// cache them since multiple transformations and actions
// will be called on the data.
JavaRDD<ApacheAccessLog> accessLogs =
logLines.map(ApacheAccessLog::parseFromLogLine).cache();

It's useful to define a sum reducer - this is a function that takes in two integers and returns their sum. This is used all over
our example.

private static Function2<Long, Long, Long> SUM_REDUCER = (a, b) -> a + b;

Next, let's calculate the average, minimum, and maximum content size of the response returned. A
extracts the content sizes, and then different actions ( reduce ,
Again, call

cache

count

min

, and

max )

map

transformation

are called to output various stats.

on the context size RDD to avoid recalculating those values for each action called on it.

// Calculate statistics based on the content size.

// Note how the contentSizes are cached as well since multiple actions
// are called on that RDD.
JavaRDD<Long> contentSizes =
accessLogs.map(ApacheAccessLog::getContentSize).cache();
System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
contentSizes.reduce(SUM_REDUCER) / contentSizes.count(),
contentSizes.min(Comparator.naturalOrder()),
contentSizes.max(Comparator.naturalOrder())));

To compute the response code counts, we have to work with key-value pairs - by using
Notice that we call

take(100)

caution before calling

instead of

collect()

mapToPair

and

reduceByKey .

to gather the final output of the response code counts. Use extreme

on an RDD since all that data will be sent to a single Spark driver and can cause the

driver to run out of memory. Even in this case where there are only a limited number of response codes and it seems safe
- if there are malformed lines in the Apache access log or a bug in the parser, there could be many invalid response
codes to cause an.

// Compute Response Code to Count.

List<Tuple2<Integer, Long>> responseCodeToCount = accessLogs
.mapToPair(log -> new Tuple2<>(log.getResponseCode(), 1L))
.reduceByKey(SUM_REDUCER)
.take(100);
System.out.println(String.format("Response code counts: %s", responseCodeToCount));

To compute any IPAddress that has accessed this server more than 10 times, we call the
map

to retrieve only the IPAddress and discard the count. Again we use

take(100)

filter

tranformation and then

to retrieve the values.

List<String> ipAddresses =
accessLogs.mapToPair(log -> new Tuple2<>(log.getIpAddress(), 1L))
.reduceByKey(SUM_REDUCER)
.filter(tuple -> tuple._2() > 10)
.map(Tuple2::_1)
.take(100);
System.out.println(String.format("IPAddresses > 10 times: %s", ipAddresses));

Last, let's calculate the top endpoints requested in this log file. We define an inner class,

ValueComparator

to help with

that. This function tells us, given two tuples, which one is first in ordering. The key of the tuple is ignored, and ordering is
based just on the values.

private static class ValueComparator<K, V>

implements Comparator<Tuple2<K, V>>, Serializable {
private Comparator<V> comparator;
public ValueComparator(Comparator<V> comparator) {
this.comparator = comparator;
}
@Override
public int compare(Tuple2<K, V> o1, Tuple2<K, V> o2) {
return comparator.compare(o1._2(), o2._2());
}
}

Then, we can use the

ValueComparator

with the

top

action to compute the top endpoints accessed on this server

according to how many times the endpoint was accessed.

List<Tuple2<String, Long>> topEndpoints = accessLogs

.mapToPair(log -> new Tuple2<>(log.getEndpoint(), 1L))
.reduceByKey(SUM_REDUCER)
.top(10, new ValueComparator<>(Comparator.<Long>naturalOrder()));
System.out.println("Top Endpoints: " + topEndpoints);

These code snippets are from LogAnalyzer.java. Now that we've walked through the code, try running that example. See
the README for language specific instructions for building and running.

Spark SQL
You should go through the Spark SQL Guide before beginning this section.
This section requires an additioal dependency on Spark SQL:

<dependency>

<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.1.0</version>
</dependency>

For those of you who are familiar with SQL, the same statistics we calculated in the previous example can be done using
Spark SQL rather than calling Spark transformations and actions directly. We walk through how to do that here.
First, we need to create a SQL Spark context. Note how we create one Spark Context, and then use that to instantiate
different flavors of Spark contexts. You should not initialize multiple Spark contexts from the SparkConf in one process.

public class LogAnalyzerSQL {

public static void main(String[] args) {
// Create the spark context.
SparkConf conf = new SparkConf().setAppName("Log Analyzer SQL");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaSQLContext sqlContext = new JavaSQLContext(sc);
if (args.length == 0) {
System.out.println("Must specify an access logs file.");
System.exit(-1);
}
String logFile = args[0];
JavaRDD<ApacheAccessLog> accessLogs = sc.textFile(logFile)
.map(ApacheAccessLog::parseFromLogLine);
// TODO: Insert code for computing log stats.
sc.stop();
}
}

Next, we need a way to register our logs data into a table. In Java, Spark SQL can infer the table schema on a standard
Java POJO - with getters and setters as we've done with ApacheAccessLog.java. (Note: if you are using a different
language besides Java, there is a different way for Spark to infer the table schema. The examples in this directory work
out of the box. Or you can also refer to the Spark SQL Guide on Data Sources for more details.)

JavaSchemaRDD schemaRDD = sqlContext.applySchema(accessLogs,

ApacheAccessLog.class);
schemaRDD.registerTempTable("logs");
sqlContext.sqlContext().cacheTable("logs");

Now, we are ready to start running some SQL queries on our table. Here's the code to compute the identical statistics in
the previous section - it should look very familiar for those of you who know SQL:

// Calculate statistics based on the content size.

Tuple4<Long, Long, Long, Long> contentSizeStats =
sqlContext.sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), MAX(contentSize) FROM logs")
.map(row -> new Tuple4<>(row.getLong(0), row.getLong(1), row.getLong(2), row.getLong(3)))
.first();
System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
contentSizeStats._1() / contentSizeStats._2(),
contentSizeStats._3(),
contentSizeStats._4()));
// Compute Response Code to Count.
// Note the use of "LIMIT 1000" since the number of responseCodes
// can potentially be too large to fit in memory.
List<Tuple2<Integer, Long>> responseCodeToCount = sqlContext
.sql("SELECT responseCode, COUNT(*) FROM logs GROUP BY responseCode LIMIT 1000")
.mapToPair(row -> new Tuple2<>(row.getInt(0), row.getLong(1)));
System.out.println(String.format("Response code counts: %s", responseCodeToCount))
.collect();
// Any IPAddress that has accessed the server more than 10 times.
List<String> ipAddresses = sqlContext
.sql("SELECT ipAddress, COUNT(*) AS total FROM logs GROUP BY ipAddress HAVING total > 10 LIMIT 100")
.map(row -> row.getString(0))
.collect();
System.out.println(String.format("IPAddresses > 10 times: %s", ipAddresses));
// Top Endpoints.
List<Tuple2<String, Long>> topEndpoints = sqlContext
.sql("SELECT endpoint, COUNT(*) AS total FROM logs GROUP BY endpoint ORDER BY total DESC LIMIT 10")
.map(row -> new Tuple2<>(row.getString(0), row.getLong(1)))
.collect();
System.out.println(String.format("Top Endpoints: %s", topEndpoints));

Note that the default SQL dialect does not allow using reserved keyworks as alias names. In other words,
AS count

will cause errors, but

SELECT COUNT(*) AS the_count

should be able to use anything as an identifier.

Try running LogAnalyzerSQL.java now.

SELECT COUNT(*)

runs fine. If you use the HiveQL parser though, then you

Spark Streaming
Go through the Spark Streaming Programming Guide before beginning this section. In particular, it covers the concept of
DStreams.
This section requires another dependency on the Spark Streaming library:

<dependency>

<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.1.0</version>
</dependency>

The earlier examples demonstrates how to compute statistics on an existing log file - but not how to do realtime
monitoring of logs. Spark Streaming enables that functionality.
To run the streaming examples, you will

tail

a log file into

netcat

to send to Spark. This is not the ideal way to get data

into Spark in a production system, but is an easy workaround for a first Spark Streaming example. We will cover best
practices for how to import data for Spark Streaming in Chapter 2.
In a terminal window, just run this command on a logfile which you will append to:

% tail -f [[YOUR_LOG_FILE]] | nc -lk 9999

If you don't have a live log file that is being updated on the fly, you can add lines manually with the included data file or
another your own log file:

% cat ../../data/apache.accesslog >> [[YOUR_LOG_FILE]]

When data is streamed into Spark, there are two common use cases covered:
1. Windowed Calculations means that you only care about data received in the last N amount of time. When monitoring
your web servers, perhaps you only care about what has happened in the last hour.
Spark Streaming conveniently splits the input data into the desired time windows for easy processing, using the
window

The

function of the streaming library.

forEachRDD

function allows you to access the RDD's created each time interval.

2. Cumulative Calculations means that you want to keep cumulative statistics, while streaming in new data to refresh
those statistics. In that case, you need to maintain the state for those statistics.
The Spark Streaming library has some convenient functions for maintaining state to support this use case,
updateStateByKey .

3. Reusing code from Batching covers how to should organize business logic code from the batch examples so that
code can be reused in Spark Streaming.
The Spark Streaming library has

transform

functions which allow you to apply arbitrary RDD-to-RDD functions,

and thus to reuse code from the batch mode of Spark.

Windowed Calculations: window()

A typical use case for log analysis is monitoring a web server, in which case you may only be interested in what's
happened for the last one hour of time and want those statistics to refresh every minute. One hour is the window length,
while one minute is the slide interval. In this example, we use a window length of 30 seconds and a slide interval of 10
seconds as a comfortable choice for development.
The windows feature of Spark Streaming makes it very easy to compute stats for a window of time, using the

window

function.
The first step is to initalize the SparkConf and context objects - in particular a streaming context. Note how only one
SparkContext is created from the conf and the streaming and sql contexts are created from those. Next, the main body
should be written. Finally, the example calls

start()

on the streaming context, and

awaitTermination()

to keep the

streaming context running and accepting streaming input.

public class LogAnalyzerStreamingSQL {

public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Log Analyzer Streaming SQL");
// Note: Only one Spark Context is created from the conf, the rest
//
are created from the original Spark context.
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc,
SLIDE_INTERVAL); // This sets the update window to be every 10 seconds.
JavaSQLContext sqlContext = new JavaSQLContext(sc);
// TODO: Insert code here to process logs.
// Start the streaming server.
jssc.start();
// Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}

The first step of the main body is to create a DStream from reading the socket.

JavaReceiverInputDStream<String> logDataDStream =
jssc.socketTextStream("localhost", 9999);

Next, call the

map

transformation to convert the logDataDStream into a ApacheAccessLog DStream.

JavaDStream<ApacheAccessLog> accessLogDStream =
logDataDStream.map(ApacheAccessLog::parseFromLogLine).cache();

Next, call

window

on the accessLogDStream to create a windowed DStream. The window function nicely packages the

input data that is being streamed into RDDs containing a window length of data, and creates a new RDD every
SLIDE_INTERVAL of time.

JavaDStream<ApacheAccessLog> windowDStream =
accessLogDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);

Then call

foreachRDD

on the windowDStream. The function passed into

forEachRDD

is called on each new RDD in the

windowDStream as the RDD is created, so every slide_interval. The RDD passed into the function contains all the input
for the last window_length of time. Now that there is an RDD of ApacheAccessLogs, simply reuse code from either two
batch examples (regular or SQL). In this example, the code was just copied and pasted, but you could refactor this code
into one place nicely for reuse in your production code base - you can reuse all your batch processing code for
streaming!

windowDStream.foreachRDD(accessLogs -> {
if (accessLogs.count() == 0) {
System.out.println("No access logs in this time interval");
return null;
}
// Insert code verbatim from LogAnalyzer.java or LogAnalyzerSQL.java here.
// Calculate statistics based on the content size.
JavaRDD<Long> contentSizes =
accessLogs.map(ApacheAccessLog::getContentSize).cache();
System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
contentSizes.reduce(SUM_REDUCER) / contentSizes.count(),
contentSizes.min(Comparator.naturalOrder()),
contentSizes.max(Comparator.naturalOrder())));
//...Won't copy the rest here...
}

Now that we've walked through the code, run LogAnalyzerStreaming.java and/or LogAnalyzerStreamingSQL.java now.
Use the

cat

command as explained before to add data to the log file periodically once you have your program up.

Cumulative Calculations: updateStateByKey()

To keep track of the log statistics for all of time, state must be maintained between processing RDD's in a DStream.
To maintain state for key-pair values, the data may be too big to fit in memory on one machine - Spark Streaming can
maintain the state for you. To do that, call the
First, in order to use
checkpoint

updateStateByKey

updateStateByKey , checkpointing

function of the Spark Streaming library.

must be enabled on the streaming context. To do that, just call

on the streaming context with a directory to write the checkpoint data. Here is part of the main function of a

streaming application that will save state for all of time:

public class LogAnalyzerStreamingTotal {

public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Log Analyzer Streaming Total");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc,
new Duration(10000)); // This sets the update window to be every 10 seconds.
// Checkpointing must be enabled to use the updateStateByKey function.
jssc.checkpoint("/tmp/log-analyzer-streaming");
// TODO: Insert code for computing log stats.
// Start the streaming server.
jssc.start();
// Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate

To compute the content size statistics, simply use static variables to save the current running sum, count, min and max of
the content sizes.

// These static variables stores

private static final AtomicLong
private static final AtomicLong
private static final AtomicLong
private static final AtomicLong

the running content size values.

runningCount = new AtomicLong(0);
runningSum = new AtomicLong(0);
runningMin = new AtomicLong(Long.MAX_VALUE);
runningMax = new AtomicLong(Long.MIN_VALUE);

To update those values, first call map on the AccessLogDStream to retrieve a contentSizeDStream. Then just update the
values for the static variables by calling foreachRDD on the contentSizeDstream, and calling actions on the RDD:

JavaDStream<Long> contentSizeDStream =
accessLogDStream.map(ApacheAccessLog::getContentSize).cache();
contentSizeDStream.foreachRDD(rdd -> {
if (rdd.count() > 0) {
runningSum.getAndAdd(rdd.reduce(SUM_REDUCER));
runningCount.getAndAdd(rdd.count());
runningMin.set(Math.min(runningMin.get(), rdd.min(Comparator.naturalOrder())));
runningMax.set(Math.max(runningMax.get(), rdd.max(Comparator.naturalOrder())));
System.out.print("Content Size Avg: " + runningSum.get() / runningCount.get());
System.out.print(", Min: " + runningMin.get());
System.out.println(", Max: " + runningMax.get());
}
return null;
});

For the other statistics, since they make use of key value pairs, static variables can't be used anymore. The amount of
state that needs to be maintained is potentially too big to fit in memory. So for those stats, we'll make use of
updateStateByKey

so Spark streaming will maintain a value for every key in our dataset.

But before we can call

updateStateByKey , we

need to create a function to pass into it.

updateStateByKey

takes in a different

reduce function. While our previous sum reducer just took in two values and output their sum, this reduce function takes
in a current value and an iterator of values, and outputs one new value.

private static Function2<List<Long>, Optional<Long>, Optional<Long>>

COMPUTE_RUNNING_SUM = (nums, current) -> {
long sum = current.or(0L);
for (long i : nums) {
sum += i;
}
return Optional.of(sum);
};

Finally, we can compute the keyed statistics for all of time with this code:

// Compute Response Code to Count.

// Note the use of updateStateByKey.
JavaPairDStream<Integer, Long> responseCodeCountDStream = accessLogDStream
.mapToPair(s -> new Tuple2<>(s.getResponseCode(), 1L))
.reduceByKey(SUM_REDUCER)
.updateStateByKey(COMPUTE_RUNNING_SUM);
responseCodeCountDStream.foreachRDD(rdd -> {
System.out.println("Response code counts: " + rdd.take(100));
return null;
});
// A DStream of ipAddresses accessed > 10 times.
JavaDStream<String> ipAddressesDStream = accessLogDStream
.mapToPair(s -> new Tuple2<>(s.getIpAddress(), 1L))
.reduceByKey(SUM_REDUCER)
.updateStateByKey(COMPUTE_RUNNING_SUM)
.filter(tuple -> tuple._2() > 10)
.map(Tuple2::_1);
ipAddressesDStream.foreachRDD(rdd -> {
List<String> ipAddresses = rdd.take(100);
System.out.println("All IPAddresses > 10 times: " + ipAddresses);
return null;
});
// A DStream of endpoint to count.
JavaPairDStream<String, Long> endpointCountsDStream = accessLogDStream
.mapToPair(s -> new Tuple2<>(s.getEndpoint(), 1L))
.reduceByKey(SUM_REDUCER)
.updateStateByKey(COMPUTE_RUNNING_SUM);
endpointCountsDStream.foreachRDD(rdd -> {
List<Tuple2<String, Long>> topEndpoints =
rdd.takeOrdered(10, new ValueComparator<>(Comparator.<Long>naturalOrder()));
System.out.println("Top Endpoints: " + topEndpoints);
return null;
});

Run LogAnalyzerStreamingTotal.java now for yourself.

Reusing Code from Batching: transform()

As you may have noticed, while the functions you called on a DStream are named the same as those you called on an
RDD in the batch example, they are not the same methods, and it may not be clear how to reuse the code from the batch
examples. In this section, we refactor the code from the batch examples and show how to reuse it here.
DStreams have
DStream. The

transform

functions which allows you to call any arbitrary RDD to RDD functions to RDD's in the

functions are perfect for reusing any RDD to RDD functions that you may have written in batch

code and want to port over to streaming. Let's look at some code to illustrate this point.
Let's say we have separated out a function,

responseCodeCount

from our batch example that can compute the response

code count given the apache access logs RDD:

public static JavaPairRDD<Integer, Long> responseCodeCount(

JavaRDD<ApacheAccessLog> accessLogRDD) {
return accessLogRDD
.mapToPair(s -> new Tuple2<>(s.getResponseCode(), 1L))
.reduceByKey(SUM_REDUCER);
}

The responseCodeCountDStream can be created by calling

accessLogDStream. Then, you can finish up by calling
for all of time, and use

forEachRDD

transformToPair

updateStateByKey

with the

responseCodeCount

function to the

to keep a running count of the response codes

to print the values out:

// Compute Response Code to Count.

// Notice the user transformToPair to produce the a DStream of
// response code counts, and then updateStateByKey to accumulate
// the response code counts for all of time.
JavaPairDStream<Integer, Long> responseCodeCountDStream = accessLogDStream
.transformToPair(LogAnalyzerStreamingTotalRefactored::responseCodeCount);
JavaPairDStream<Integer, Long> cumulativeResponseCodeCountDStream =
responseCodeCountDStream.updateStateByKey(COMPUTE_RUNNING_SUM);
cumulativeResponseCodeCountDStream.foreachRDD(rdd -> {
System.out.println("Response code counts: " + rdd.take(100));
return null;
});

It is possible to combine

transform

functions before and after an

updateStateByKey

as well:

// A DStream of ipAddresses accessed > 10 times.

JavaDStream<String> ipAddressesDStream = accessLogDStream
.transformToPair(LogAnalyzerStreamingTotalRefactored::ipAddressCount)
.updateStateByKey(COMPUTE_RUNNING_SUM)
.transform(LogAnalyzerStreamingTotalRefactored::filterIPAddress);
ipAddressesDStream.foreachRDD(rdd -> {
List<String> ipAddresses = rdd.take(100);
System.out.println("All IPAddresses > 10 times: " + ipAddresses);
return null;
});

Take a closer look at LogAnalyzerStreamingTotalRefactored.java now to see how that code has been refactored to reuse
code from the batch example.

Section 2: Importing Data

In the last section we covered how to get started with Spark for log analysis, but in those examples, data was just pulled
in from a local file and the statistics were printed to standard out. In this chapter, we cover techniques for loading and
exporting data that is suitable for a production system. In particular, the techniques must scale to handle large production
volumes of logs.
To scale, Apache Spark is meant to be deployed on a cluster of machines. Read the Spark Cluster Overview Guide, so
that you understand the different between the Spark driver vs. the executor nodes.
While you could continue running the examples in local mode, it is recommended that you set up a Spark cluster to run
the remaining examples on and get practice working with the cluster - such as familiarizing yourself with the web
interface of the cluster. You can run a small cluster on your local machine by following the instructions for Spark
Standalone Mode. Optionally, if you have access to more machines - such as on AWS or your organization has its own
datacenters, consult the cluster overview guide to do that.
Once you get a Spark cluster up:
Use spark-submit to run your jobs rather than using the JVM parameter. Run one of the examples from the previous
chapter to check your set up.
Poke around and familiarize with the web interfaces for Spark. It's at http://localhost:8080 if you set up a local cluster.
There are two ways to import data into Spark:
1. Batch Data Import - if you are loading a dataset all at once.
2. Streaming Data Import - if you wish to continuously stream data into Spark.

Batch Data Import

This section covers batch importing data into Apache Spark, such as seen in the non-streaming examples from Chapter
1. Those examples load data from files all at once into one RDD, processes that RDD, the job completes, and the
program exits. In a production system, you could set up a cron job to kick off a batch job each night to process the last
day's worth of log files and then publish statistics for the last day.
Importing From Files covers caveats when importing data from files.
Importing from Databases links to examples of reading data from databases.

Importing from Files

To support batch import of data on a Spark cluster, the data needs to be accessible by all machines on the cluster. Files
that are only accessible on one worker machine and cannot be read by the others will cause failures.
If you have a small dataset that can fit on one machine, you could manually copy your files onto all the nodes on your
Spark cluster, perhaps using

rsync

to make that easier.

NFS or some other network file system makes sure all your machines can access the same files without requiring you to
copy the files around. But NFS isn't fault tolerant to machine failures and if your dataset is too big to fit on one NFS volume
- you'd have to store the data on multiple volumes and figure out which volume a particular file is on - which could get
cumbersome.
HDFS and S3 are great file systems for massive datasets - built to store a lot of data and give all the machines on the
cluster access to those files, while still being fault tolerant. We give a few more tips on running Spark with these file
systems since they are recommended.
S3 is an Amazon AWS solution for storing files in the cloud, easily accessible to anyone who signs up for an account.
HDFS is a distributed file system that is part of Hadoop and can be installed on your own datacenters.
The good news is that regardless of which of these file systems you choose, you can run the same code to read from
them - these file systems are all "Hadoop compatible" file systems.
In this section, you should try running LogAnalyzerBatchImport.java on any files on your file system of choice. There is
nothing new in this code - it's just a refactor of the First Log Analyzer from Chapter One. Try passing in "*" or "?" for the
textFile path, and Spark will read in all the files that match that pattern to create the RDD.

S3
S3 is Amazon Web Services's solution for storing large files in the cloud. On a production system, you want your Amazon
EC2 compute nodes on the same zone as your S3 files for speed as well as cost reasons. While S3 files can be read from
other machines, it would take a long time and be expensive (Amazon S3 data transfer prices differ if you read data within
AWS vs. to somewhere else on the internet).
See running Spark on EC2 if you want to launch a Spark cluster on AWS - charges apply.
If you choose to run this example with a local Spark cluster on your machine rather than EC2 compute nodes to read the
files in S3, use a small data input source!
1. Sign up for an Amazon Web Services Account.
2. Load example log files to s3.
Log into the AWS console for S3
Create an S3 bucket.
Upload a couple of example log files to that bucket.
Your files will be at the path: s3n://YOUR_BUCKET_NAME/YOUR_LOGFILE.log
3. Configure your security credentials for AWS:
Create and download your security credentials
Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the correct
values on all machines on your cluster. These can also be set in your SparkContext object programmatically like
this:

jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

Now, run LogAnalyzerBatchImport.java passing in the s3n path to your files.

HDFS
HDFS is a file system that is meant for storing large data sets and being fault tolerant. In a production system, your Spark
cluster should ideally be on the same machines as your Hadoop cluster to make it easy to read files. The Spark binary
you run on your clusters must be compiled with the same HDFS version as the one you wish to use.
There are many ways to install HDFS, but heading to the Hadoop homepage is one way to get started and run hdfs
locally on your machine.
Run LogAnalyzerBatchImport.java on any file pattern on your hdfs directory.

Reading from Databases

Most likely, you aren't going to be storing your logs data in a database (that is likely too expensive), but there may be
other data you want to input to Spark that is stored in a database. Perhaps that data can be joined with the logs to provide
more information.
The same way file systems have evolved over time to scale, so have databases.
A simple database to begin with is a single database - SQL databases are quite common. When that fills, one option is to
buy a larger machine for the database. The price of these larger machines gets increasingly expensive (even price per
unit of storage) and it is eventually no longer possible to buy a machine big enough at some point. A common choice
then is to switch to sharded databases. With that option, application level code is written to determine on which database
shard a piece of data should be read or written to.
To read data in from a SQL database, the JdbcRDD is one option for a moderate amount of data:
https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/JdbcRDD.html
Recently, there has been a movement in the database world towards NoSQL or Key-Value databases that were
designed to scale. For these databases, it's usually transparent to the application developer that the underlying database
stores data on multiple machines. Cassandra is one very popular NoSQL database.
To read data from Cassandra into Spark, see the Spark Cassandra Connector:
https://github.com/datastax/spark-cassandra-connector
If you use a different database, Spark may have a built-in library for importing from that database, but more often 3rd
parties offer Spark integration - so search for that.
As usual, reading a small amount of data from a database is much easier than reading a ton of data. It's important to
understand your database and Spark's distributed programming model in order to write optimal code for importing a very
large dataset.

Streaming Data Import

This section covers importing data for streaming. The streaming example in the previous chapter received data through a
single socket - which is not a scalable solution. In a real production system, there are many servers continuously writing
logs, and we want to process all of those files. This section contains scalable solutions for data import. Since streaming is
now used, there is no longer the need for a nightly batch job to process logs, but instead - this logs processing program
can be long-lived - continuously receiving new logs data, processing the data, and computing log stats.
1. Built In Methods for Streaming Import
2. Kafka

Built In Methods for Streaming Import

The StreamingContext has many built in methods for importing data to streaming.
previous chapter, and

textFileStream

is introduced here. The

textFileStream

socketTextStream

was introduced in the

method monitors any Hadoop-compatible

filesystem directory for new files and when it detects a new file - reads it into Spark Streaming. Just replace the call to
socketTextStream

with

textFileStream

, and pass in the directory to monitor for log files.

// This methods monitors a directory for new files

// to read in for streaming.
JavaDStream<String> logData = jssc.textFileStream(directory);

Try running LogAnalyzerStreamingImportDirectory.java by specifying a directory. You'll also need to drop or copy some
new log files into that directory while the program is running to see the calculated values update.
There are more built-in input methods for streaming - check them out in the reference API documents for the
StreamingContext.

Kafka
While the previous example picks up new log files right away - the log files aren't copied over until a long time after the
HTTP requests in the logs actually occurred. While that enables auto-refresh of log data, that's still not realtime. To get
realtime logs processing, we need a way to send over log lines immediately. Kafka is a high-throughput distributed
message system that is perfect for that use case. Spark contains an external module importing data from Kafka.
Here is some useful documentation to set up Kafka for Spark Streaming:
Kafka Documentation
KafkaUtils class in the external module of the Spark project - This is the external module that has been written that
imports data from Kafka into Spark Streaming.
Spark Streaming Example of using Kafka - This is an example that demonstrates how to call KafkaUtils.

Exporting Data out of Spark

This section contains methods for exporting data out of Spark into systems. First, you'll have to figure out if your output
data is small (meaning can fit on memory on one machine) or large (too big to fit into memory on one machine). Consult
these two sections based on your use case.
Small Datasets - If you have a small dataset, you can call an action on this dataset to retrieve objects in memory on
the driver program, and then write those objects out any way you want.
Large Datasets - For a large dataset, it's important to remember that this dataset is too large to fit in memory on the
driver program. In that case, you can either call Spark to write the data to files directly from the Spark workers or you
can implement your own custom solution.

Exporting Small Datasets

If the data you are exporting out of Spark is small, you can just use an action to convert the RDD into objects in memory
on the driver program, and then write that output directly to any data storage solution of your choosing. You may
remember that we called the

take(N)

action where N is some finite number instead of the

collect()

action to ensure the

output fits in memory - no matter how big the input data set may be - this is good practice. This section walks through
example code where you'll write the log statistics to a file.
It may not be that useful to have these stats output to a file - in practice, you might write these statistics to a database for
your presentation layer to access.

LogStatistics logStatistics = logAnalyzerRDD.processRdd(accessLogs);

String outputFile = args[1];
Writer out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(outputFile)));
Tuple4<Long, Long, Long, Long> contentSizeStats =
logStatistics.getContentSizeStats();
out.write(String.format("Content Size Avg: %s, Min: %s, Max: %s\n",
contentSizeStats._1() / contentSizeStats._2(),
contentSizeStats._3(),
contentSizeStats._4()));
List<Tuple2<Integer, Long>> responseCodeToCount =
logStatistics.getResponseCodeToCount();
out.write(String.format("Response code counts: %s\n", responseCodeToCount));
List<String> ipAddresses = logStatistics.getIpAddresses();
out.write(String.format("IPAddresses > 10 times: %s\n", ipAddresses));
List<Tuple2<String, Long>> topEndpoints = logStatistics.getTopEndpoints();
out.write(String.format("Top Endpoints: %s\n", topEndpoints));
out.close();

Now, run LogAnalyzerExportSmallData.java. Try modifying it to write to a database of your own choosing.

Exporting Large Datasets

If you are exporting a very large dataset, you can't call

collect()

or a similar action to read all the data from the RDD onto

the single driver program - that could trigger out of memory problems. Instead, you have to be careful about saving a
large RDD. See these two sections for more information.
Save the RDD to Files - There are built in methods in Spark for saving a large RDD to files.
Save the RDD to a Database - This section contains recommended best practices for saving a large RDD to a
database.

Save the RDD to files

RDD's have some built in methods for saving them to disk. Once in files, many of the Hadoop databases can bulk load in
data directly from files, as long as they are in a specific format.
In the following code example, we demonstrate the simple
files where the

.toString()

.saveAsTextFile()

method. This will write the data to simple text

method is called on each RDD element and one element is written per line. The number of files

output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control
the number of output files.

public class LogAnalyzerExportRDD {

// Optionally modify this based as makes sense for your dataset.
public static final int NUM_PARTITIONS = 2;
public static void main(String[] args) throws IOException {
// Create the spark context.
SparkConf conf = new SparkConf().setAppName("Log Analyzer SQL");
JavaSparkContext sc = new JavaSparkContext(conf);
if (args.length < 2) {
System.out.println("Must specify an access logs file and an output file.");
System.exit(-1);
}
String inputFile = args[0];
String outputDirectory = args[1];
JavaRDD<ApacheAccessLog> accessLogs = sc.textFile(inputFile)
.map(ApacheAccessLog::parseFromLogLine)
.repartition(NUM_PARTITIONS); // Optionally, change this.
accessLogs.saveAsTextFile(outputDirectory);
sc.stop();
}
}

Run LogAnalyzerExportRDD.java now. Notice that the number of files output is the same as the number of partitionds of
the RDD.
Refer to the API documentation for other built in methods for saving to file. There are different built in methods for saving
RDD's to files in various formats, so skim the whole RDD package to see if there is something to suit your needs.
Sqoop is a very useful tool that can import Hadoop files into various databases, and is thus very useful to use for getting
the data written into files from Spark into your production database.

Save an RDD to a Database

You can write your own custom writer and call a transform on your RDD to write each element to a database of your
choice, but there's a lot of ways to write something that looks like it would work, but does not work well in a distributed
environment. Here are some things to watch out for:
A common naive mistake is to open a connection on the Spark driver program, and then try to use that connection on
the Spark workers. The connection should be opened on the Spark worker, such as by calling

forEachPartition

and

opening the connection inside that function.

Use partitioning to control the parallelism for writing to your data storage. Your data storage may not support too
many concurrent connections.
Use batching for writing out multiple objects at a time if batching is optimal for your data storage.
Make sure your write mechanism is resilient to failures. Writing out a very large dataset can take a long time, which
increases the chance something can go wrong - a network failure, etc.
Consider utilizing a static pool of database connections on your Spark workers.
If you are writing to a sharded data storage, partition your RDD to match your sharding strategy. That way each of
your Spark workers only connects to one database shard, rather than each Spark worker connecting to every
database shard.
Be cautious when writing out so much data, and make sure you understand the distributed nature of Spark!

Logs Analyzer Application

This directory contains code from the chapters, assembled together to form a sample logs analyzer application. Other
libraries that are not discussed have been used to make this a more finished application. These are the features of our
MVP (minimal viable product) logs analyzer application:
Reads in new log files from a directory and inputs those new files into Spark Streaming.
Compute stats on the logs using Spark - stats for the last 30 seconds are calculated as well as for all of time.
Write the calculated stats to an html file on the local file system that gets refreshed on a set interval.

You can use this simple application as a skeleton and combine features from the chapters to produce your own custom
logs analysis application. The main class is LogAnalyzerAppMain.java.

Twitter Streaming Language Classifier

In this reference application, we show how you can use Apache Spark for training a language classifier - replacing a
whole suite of tools you may be currently using.
This reference application was demo-ed at a meetup which is taped here - the link skips straight to demo time, but the talk
before that is useful too:

Here are 5 typical stages for creating a production ready classifer - oftentimes each stage is done with a different set of
tools and even by different engineering teams:
1. Scrape/collect a dataset.
2. Clean and explore the data, doing feature extraction.
3. Build a model on the data and iterate/improve it.
4. Improve the model using more and more data, perhaps upgrading your infrastructure to support building larger
models. (Such as migrating over to Hadoop.)
5. Apply the model in real time.
Spark can be used for all of the above and simple to use for all these purposes. We've chosen to break up the language
classifier into 3 parts with one simple Spark program to accomplish each part:
1. Collect a Dataset of Tweets - Spark Streaming is used to collect a dataset of tweets and write them out to files.
2. Examine the Tweets and Train a Model - Spark SQL is used to examine the dataset of Tweets. Then Spark MLLib is
used to apply KMeans algorithm to train a model on the data.
3. Apply the Model in Real-time - Spark Streaming and Spark MLLib are used to filter a live stream of Tweets for those
that match the specified cluster.

Part 1: Collect a Dataset of Tweets

Spark Streaming is used to collect tweets as the dataset. The tweets are written out in json format, one tweet per line. A
file of tweets is written every time interval until at least the desired number of tweets is collected.
See Collect.scala for the full code. We'll walk through some of the interesting bits now.
Collect.scala takes in the following argument list:
1. outputDirectory - the output directory for writing the tweets. The files will be named 'part-%05d'
2. numTweetsToCollect - this is the minimum number of tweets to collect before the program exits.
3. intervalInSeconds - write out a new set of tweets every interval.
4. partitionsEachInterval - this is used to control the number of output files written for each interval
Collect.scala will also require Twitter API Credentials. If you have never signed up for a Twitter Api Credentials, follow
these steps here. The Twitter credentials are passed in through command line flags.
Below is a snippet of the actual code in Collect.scala. The code calls TwitterUtils in the Spark Streaming Twitter library to
get a DStream of tweets. Then, map is called to convert the tweets to JSON format. Finally, call for each RDD on the
DStream. This example repartitions the RDD to write out so that you can control the number of output files.

val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)

.map(gson.toJson(_))
tweetStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0) {
val outputRDD = rdd.repartition(partitionsEachInterval)
outputRDD.saveAsTextFile(
outputDirectory + "/tweets_" + time.milliseconds.toString)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
System.exit(0)
}
}
})

Run Collect.scala yourself to collect a dataset of tweets:

% ${YOUR_SPARK_HOME}/bin/spark-submit \
--class "com.databricks.apps.twitter_classifier.Collect" \
--master ${YOUR_SPARK_MASTER:-local[4]} \
target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
${YOUR_OUTPUT_DIR:-/tmp/tweets} \
${NUM_TWEETS_TO_COLLECT:-10000} \
${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \
${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \
--consumerKey ${YOUR_TWITTER_CONSUMER_KEY} \
--consumerSecret ${YOUR_TWITTER_CONSUMER_SECRET} \
--accessToken ${YOUR_TWITTER_ACCESS_TOKEN} \
--accessTokenSecret ${YOUR_TWITTER_ACCESS_SECRET}

Part 2: Examine Tweets and Train a Model

The second program examines the data found in tweets and trains a language classifier using KMeans clustering on the
tweets:
Examine - Spark SQL is used to gather data about the tweets - to look at a few of them, and to count the total number
of tweets for the most common languages of the user.
Train - Spark MLLib is used for applying the KMeans algorithm for clustering the tweets. The number of clusters and
the number of iterations of algorithm are configurable. After training the model, some sample tweets from the different
clusters are shown.
See here for the command to run part 2.

Examine with Spark SQL

Spark SQL can be used to examine data based on the tweets. Below are some relevant code snippets from
ExamineAndTrain.scala.
First, here is code to pretty print 5 sample tweets so that they are more humun readable.

val tweets = sc.textFile(tweetInput)

for (tweet <- tweets.take(5)) {
println(gson.toJson(jsonParser.parse(tweet)))
}

Spark SQL can load JSON files and infer the schema based on that data. Here is the code to load the json files, register
the data in the temp table called "tweetTable" and print out the schema based on that.

val tweetTable = sqlContext.jsonFile(tweetInput)

tweetTable.registerTempTable("tweetTable")
tweetTable.printSchema()

Now, look at the text of 10 sample tweets.

sqlContext.sql(
"SELECT text FROM tweetTable LIMIT 10")
.collect().foreach(println)

View the user language, user name, and text for 10 sample tweets.

sqlContext.sql(
"SELECT user.lang, user.name, text FROM tweetTable LIMIT 10")
.collect().foreach(println)

Finally, show the count of tweets by user language. This can help determine the number of clusters is ideal for this
dataset of tweets.

sqlContext.sql(
"SELECT user.lang, COUNT(*) as cnt FROM tweetTable " +
"GROUP BY user.lang ORDER BY cnt DESC limit 1000")
.collect.foreach(println)

Train with Spark MLLib

This section covers how to train a language classifier using the texts in the Tweets.
First, we need to featurize the Tweet text. MLLib has a HashingTF class that does that:

object Utils {
...
val numFeatures = 1000
val tf = new HashingTF(numFeatures)
/**
* Create feature vectors by turning each tweet into bigrams of
* characters (an n-gram model) and then hashing those to a
* length-1000 feature vector that we can pass to MLlib.
* This is a common way to decrease the number of features in a
* model while still getting excellent accuracy (otherwise every
* pair of Unicode characters would potentially be a feature).
*/
def featurize(s: String): Vector = {
tf.transform(s.sliding(2).toSeq)
}
...
}

This is the code that actually grabs the tweet text from the tweetTable and featurizes them. KMeans is called to create the
number of clusters and the algorithm is applied the specified number of iterations. FInally, the trained model is persisted
so it can be loaded later.

val texts = sqlContext.sql("SELECT text from tweetTable").map(_.head.toString)

// Caches the vectors since it will be used many times by KMeans.
val vectors = texts.map(Utils.featurize).cache()
vectors.count() // Calls an action to create the cache.
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir)

Last, here is some code to take a sample set of tweets and print them out by cluster, we can see what language clusters
our model contains. Pick your favorite to use for part 3.

val some_tweets = texts.take(100)

for (i <- 0 until numClusters) {
println(s"\nCLUSTER $i:")
some_tweets.foreach { t =>
if (model.predict(Utils.featurize(t)) == i) {
println(t)
}
}
}

Run Examine and Train

To run this program, the following argument list is required:
1. YOUR_TWEET_INPUT - This is the file pattern for input tweets.
2. OUTPUT_MODEL_DIR - This is the directory to persist the model.
3. NUM_CLUSTERS - The number of clusters the algorithm should create.
4. NUM_ITERATIONS - The number of iterations the algorithm should be run.
Here is an example command to run ExamineAndTrain.scala:

% ${YOUR_SPARK_HOME}/bin/spark-submit \
--class "com.databricks.apps.twitter_classifier.ExamineAndTrain" \
--master ${YOUR_SPARK_MASTER:-local[4]} \
target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
"${YOUR_TWEET_INPUT:-/tmp/tweets/tweets*/part-*}" \
${OUTPUT_MODEL_DIR:-/tmp/tweets/model} \
${NUM_CLUSTERS:-10} \
${NUM_ITERATIONS:-20}

Part 3: Apply the Model in Real Time

Spark Streaming is used to filter live tweets coming in for those that are classified as the specified cluster of tweets. It
takes the following arguments:
1. modelDirectory - This the directory where the model that was trained in part 2 was persisted.
2. clusterNumber - This is the cluster you want to select from part 2. Only tweets that match this language cluster will be
printed out.
This program is very simple - this is the bulk of the code below. First, load up a Spark Streaming Context. Second, create
a Twitter DStream and map them to grab the text. Third, load up the KMeans model that was trained in step 3. Finally,
apply the model on the tweets, filtering out only those that match the specified cluster, and print the matching tweets.

println("Initializing Streaming Spark Context...")

val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val ssc = new StreamingContext(conf, Seconds(5))
println("Initializing Twitter stream...")
val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)
val statuses = tweets.map(_.getText)
println("Initalizaing the the KMeans model...")
val model = new KMeansModel(ssc.sparkContext.objectFile[Vector](
modelFile.toString).collect())
val filteredTweets = statuses
.filter(t => model.predict(Utils.featurize(t)) == clusterNumber)
filteredTweets.print()

Now, run Predict.scala:

% ${YOUR_SPARK_HOME}/bin/spark-submit \
--class "com.databricks.apps.twitter_classifier.Predict" \
--master ${YOUR_SPARK_MASTER:-local[4]} \
target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
${YOUR_MODEL_DIR:-/tmp/tweets/model} \
${CLUSTER_TO_FILTER:-7} \
--consumerKey ${YOUR_TWITTER_CONSUMER_KEY} \
--consumerSecret ${YOUR_TWITTER_CONSUMER_SECRET} \
--accessToken ${YOUR_TWITTER_ACCESS_TOKEN} \
--accessTokenSecret ${YOUR_TWITTER_ACCESS_SECRET}

The Organization of Information 4th Edition (2017, Libraries Unlimited)
97% (34)
The Organization of Information 4th Edition (2017, Libraries Unlimited)
483 pages
Mysql 3rd Edition
100% (10)
Mysql 3rd Edition
646 pages
Google Hacking Database
83% (18)
Google Hacking Database
91 pages
Dangerous Google - Searching For Secrets PDF
88% (26)
Dangerous Google - Searching For Secrets PDF
12 pages
Voyager 7S Data Dictionary - Through Update DB 5854 - 060619
67% (3)
Voyager 7S Data Dictionary - Through Update DB 5854 - 060619
3,877 pages
Data Structures Cheat Sheet
71% (14)
Data Structures Cheat Sheet
2 pages
Google Hacking Database
No ratings yet
Google Hacking Database
91 pages
Understanding Database Types - by Alex Xu
No ratings yet
Understanding Database Types - by Alex Xu
13 pages
Snowflake Certification
No ratings yet
Snowflake Certification
102 pages
How To Use Google Hack
100% (1)
How To Use Google Hack
4 pages
Policy Document Ucc Redemption Understanding The Process Further
80% (20)
Policy Document Ucc Redemption Understanding The Process Further
37 pages
Hackers Black Book (2011-Edition)
No ratings yet
Hackers Black Book (2011-Edition)
6 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Optimizing Tableau Aws Redshift Whitepaper
No ratings yet
Optimizing Tableau Aws Redshift Whitepaper
33 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
91% (11)
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
7 pages
Google Hacking
100% (7)
Google Hacking
66 pages
Color-Coded Genealogy Research Filing System
No ratings yet
Color-Coded Genealogy Research Filing System
15 pages
Kali Linux Tools Descriptions
100% (2)
Kali Linux Tools Descriptions
26 pages
Databricks Cloud How To Log Analysis Example
No ratings yet
Databricks Cloud How To Log Analysis Example
9 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
No ratings yet
Open Source Data Engineering Landscape 2024 by Alireza Sadeghi Feb, 2024 Medium
25 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
Kanishk Resume
No ratings yet
Kanishk Resume
5 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Azuredatabricks New
No ratings yet
Azuredatabricks New
22 pages
Step Install Cloudera Manager & Setup Cloudera Cluster
No ratings yet
Step Install Cloudera Manager & Setup Cloudera Cluster
23 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Cloudera Administration PDF
No ratings yet
Cloudera Administration PDF
478 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Azure Data Engineer Learning Path (OCT 2019)
No ratings yet
Azure Data Engineer Learning Path (OCT 2019)
1 page
Data Engineer Master Program v2
No ratings yet
Data Engineer Master Program v2
27 pages
Resume Mohit
No ratings yet
Resume Mohit
6 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
No ratings yet
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
58 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Data-Engineering Course Structure
No ratings yet
Data-Engineering Course Structure
9 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Apache Airflow Fundamentals Study Guide
No ratings yet
Apache Airflow Fundamentals Study Guide
7 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Data Engineering
100% (1)
Data Engineering
131 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Azure Data Enginner
No ratings yet
Azure Data Enginner
8 pages
De Mod 3 Manage Data With Delta Lake
No ratings yet
De Mod 3 Manage Data With Delta Lake
16 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
8 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Useful Google Hacks
100% (4)
Useful Google Hacks
7 pages
SQL Crash Course
No ratings yet
SQL Crash Course
17 pages
Microsoft Access For Beginners PDF
100% (2)
Microsoft Access For Beginners PDF
196 pages
TITLE 28 United States Code Sec. 3002
91% (11)
TITLE 28 United States Code Sec. 3002
77 pages
Google Hacking Database PDF
0% (1)
Google Hacking Database PDF
100 pages
Database Management Systems
No ratings yet
Database Management Systems
19 pages
24 Essential SQL Interview Questions
No ratings yet
24 Essential SQL Interview Questions
13 pages
Mythic Magazine #015
100% (3)
Mythic Magazine #015
34 pages
Open Source Intelligence
No ratings yet
Open Source Intelligence
4 pages
Open Source Intelligence (Osint) Reference Sheet
0% (1)
Open Source Intelligence (Osint) Reference Sheet
23 pages
Other Link Classified - How To Find The Book I Want
No ratings yet
Other Link Classified - How To Find The Book I Want
453 pages
Master Cyber Digital Forensics
50% (2)
Master Cyber Digital Forensics
114 pages
SQL Cheat Sheet
91% (11)
SQL Cheat Sheet
11 pages
Anatomy of A Hack
No ratings yet
Anatomy of A Hack
43 pages
Network Automation Cookbook
No ratings yet
Network Automation Cookbook
44 pages
Unit-1 Cloud Computing
No ratings yet
Unit-1 Cloud Computing
17 pages
Citrix Netscaler Clustering Guide v2
No ratings yet
Citrix Netscaler Clustering Guide v2
67 pages
21CS71 BIG DATA ANALYTICS
No ratings yet
21CS71 BIG DATA ANALYTICS
17 pages
NetBackup Upgrade Guide 7.5
No ratings yet
NetBackup Upgrade Guide 7.5
38 pages
SVT Deployment Manager
No ratings yet
SVT Deployment Manager
252 pages
DevOps с Laravel 2. Docker Swarm
No ratings yet
DevOps с Laravel 2. Docker Swarm
82 pages
Tutorials
No ratings yet
Tutorials
175 pages
Quiz 2
100% (1)
Quiz 2
52 pages
NVD 2031 Hybrid Cloud 6 5 on Premises Design (1)
No ratings yet
NVD 2031 Hybrid Cloud 6 5 on Premises Design (1)
92 pages
Cisco Application Centric Infrastructure
No ratings yet
Cisco Application Centric Infrastructure
13 pages
PROD - DR Plan - Overview
No ratings yet
PROD - DR Plan - Overview
6 pages
Cloud Computing Viva Question & Answer
0% (1)
Cloud Computing Viva Question & Answer
9 pages
Active Directory Right Management Service
No ratings yet
Active Directory Right Management Service
22 pages
Practice Create Oracle 12c R1 Two-Node RAC Database
No ratings yet
Practice Create Oracle 12c R1 Two-Node RAC Database
25 pages
200 Questions AI-900
No ratings yet
200 Questions AI-900
19 pages
Akkaatoz: An Architect'S Guide To Designing, Building, and Running Reactive Systems
No ratings yet
Akkaatoz: An Architect'S Guide To Designing, Building, and Running Reactive Systems
35 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
3 pages
Rain Technology and Its Implementation
No ratings yet
Rain Technology and Its Implementation
6 pages
Download Oracle Cloud Infrastructure: A Guide to Building Cloud Native Applications Jeevan Gheevarghese Joseph & Adao Oliveira Junior & Mickey Boxell ebook All Chapters PDF
100% (5)
Download Oracle Cloud Infrastructure: A Guide to Building Cloud Native Applications Jeevan Gheevarghese Joseph & Adao Oliveira Junior & Mickey Boxell ebook All Chapters PDF
66 pages
Technical Seminar: Paramnet Iii
No ratings yet
Technical Seminar: Paramnet Iii
25 pages
CLI Book 3 - Cisco ASA Series VPN CLI Configuration Guide, 9.2 - General VPN Parameters (Cisco ASA 5500-X Series Firewalls) - Cisco
100% (1)
CLI Book 3 - Cisco ASA Series VPN CLI Configuration Guide, 9.2 - General VPN Parameters (Cisco ASA 5500-X Series Firewalls) - Cisco
25 pages
Advancements in Supercomputing Architectures
No ratings yet
Advancements in Supercomputing Architectures
10 pages
10999A - SQLServer On Linux
No ratings yet
10999A - SQLServer On Linux
3 pages
PACOM Product Catalogue GLOBAL - V3 - 0 PDF
No ratings yet
PACOM Product Catalogue GLOBAL - V3 - 0 PDF
49 pages
Tez Design v1.1
No ratings yet
Tez Design v1.1
15 pages
Z 99 Mirza Junaid Hadoop
No ratings yet
Z 99 Mirza Junaid Hadoop
3 pages
Visual TD Cluster Configuration Guide
No ratings yet
Visual TD Cluster Configuration Guide
31 pages
2022 - Final - VIT - CMPN - Autonomy Scheme and Syllabus - FY - MTech - V3
No ratings yet
2022 - Final - VIT - CMPN - Autonomy Scheme and Syllabus - FY - MTech - V3
56 pages
Extracting Data From An API On Databricks - by Ryan Chynoweth - Feb, 2024 - Medium
No ratings yet
Extracting Data From An API On Databricks - by Ryan Chynoweth - Feb, 2024 - Medium
12 pages
High Availability Clustering With Alfresco 1214164250440600 8
No ratings yet
High Availability Clustering With Alfresco 1214164250440600 8
13 pages