Spark: Building an application from Start to Finish

Confidential and Proprietary to Daugherty Business Solutions
Spark: Building an application from Start
to Finish
Adam Doyle
STLHUG August 2017

Confidential & Proprietary to Daugherty Business Solutions.
EIM and Analytics
Data Science
• Predictive and Prescriptive Analytics
• Social, Text and Sentiment Analytics
• Natural Language Processing
• Machine Learning, Artificial Intelligence
• SPSS, SAS, R, IBM Watson™
Strategy and Competency Building
• Build the right, comprehensive solution blueprint across
12 Domains
• Establish, specific, actionable plan and ROIs
• Protecting your investments
• Organization, Talent, Competency
• Processes, Methods, Techniques, Tools
• Speed – Agile EIM Transformation
• Governance processes
Customer and Business Analytics
• Customer/Buyer/Channel Segmentation
• Persona Development, Customer Scoring (Value, Potential)
• Attrition Modeling, Engagement and Response Modeling
• Inventory Management, Marketing Campaigns
• Product Design Analytics, Workforce Planning, Location
Based Advertising
• Data Monetization
Traditional Data Warehouse and Business
Intelligence
• EDW, ODS, Data Mart and Integration
• Master Data Management
• Data Governance
• Dashboards, Scorecards,
• Reports , Alerts
• Multidimensional Analysis
• Ad hoc slicing and dicing
• Self Service Enablement
• Cloud Migration and Agile EIM
ANALYTICS
STRATEGY
EIM and
ANALYTICS
400+ employees
strong
Digital Engagement/Analytics
• Customer Engagement Strategies
• Omni-channel and Integrated Marketing
• Strategic Planning, Building and Executing
Digital and Customer Engagement Solutions.
Big Data and Next
Generation Technologies
• Data Lab Development Centers
• Data Lakes, Analytic Platforms
• Hadoop (Cloudera, Hortonworks)
• NoSQL / Graph DB (MongoDB, DataStax
• Cloud platforms (AWS, Google, Azure)
• Spark, Sqoop, Hive, Pig, Kafka, etc.

• 20 year veteran of the St. Louis
IT community
• Co-Organizer, St. Louis Hadoop
User Group
• Big Data Community Lead,
Daugherty Business Solutions
• Formerly Big Data Solution
Architect at Amitech, Lead Big
Data developer at Mercy
• Speaker at local and national
Big Data conferences
Meet Adam Doyle
3

• GDPR
• Why Spark
• Infrastructure
• Language
• Components
• Development
• Debugging
• Deployment
• Monitoring
• Questions
4
Agenda

• If your company stores data about citizens from any of the countries in
the European Union, you should be preparing for GDPR.
• Here is the official link http://www.eugdpr.org/ Here is why:
(Financial) Penalties: Under GDPR organizations in breach of GDPR
can be fined up to 4% of annual global turnover (Revenue) or €20
Million (whichever is greater). This is the maximum fine that can
be imposed for the most serious infringements e.g. not having
sufficient customer consent to process data or violating the core of
Privacy by Design concepts.
5
GDPR

• Privacy by Design is an approach to systems engineering which
takes privacy into account throughout the whole engineering process.
• Privacy by Design is not about data protection but designing so data
doesn't need protection. The root principle therefore is based on
enabling service without data control transfer from the citizen to the
system (the citizen become identifiable or recognizable).
• How Privacy by Design is achieved depends on the application,
technologies and choice of approach.
• Whether the methodology actually achieve Privacy by Design is not to
be evaluated based on intent or approach, but outcome. I.e. if data
do not need protection to not represent a risk to the citizens, the
principle of Privacy by Design can be said to be achieved.
• Even "anonymized" data are still personal data if data are derived from
personal data outside the control of the individual citizen in question
or any means to re-identify or recognize citizens or citizen devices
exist. Anonymous isn’t anonymous if you can reverse it.
6
Privacy by Design

• One simple example is Dynamic Host Configuration Protocol (DHCP) where
devices based on random identifiers gets an IP from the server and thus is enabled
to communicate without having leaked personal identifiers per se.
• A more advanced example is Global Positioning System where devices client-side
can detect their geographical location without leaking identity or location.
• Another example in Internet of Things is RFID where citizens' ability to
communicate with their devices without leaking identifiers can be achieved
using Zero-knowledge proof.
7
Good Privacy by Design

• Is your company doing business with European entities?
• Have you analyzed your IT infrastructure for privacy leakage?
• Can someone reverse your anonymity algorithms?
• How can you better address privacy concerns?
• “The cloud industry is aware of GDPR. We're actually scrambling to find
more IT consulting firms/partners who could handle this to come on
board with us. Many of the U.S. companies, some in St. Louis, that do
business with EU, have data stored in EU, or are acquired by/acquiring E
• https://www.datanami.com/this-just-in/mapr-talend-collaborate-deliver-
governed-gdpr-data-lake-solution/
• https://hortonworks.com/webinar/apache-atlas-ranger-can-help-become-
gdpr-compliant/
8
Discussion

Confidential and Proprietary to Daugherty Business Solutions 9
Why Spark in Pictures
https://www.sigmoid.com/apache-spark-internals/

• The client wants to get a real-time view of where the tweets about
them are coming from.
10
Problem statement

When deploying Spark for use, you have a couple of options
11
Infrastructure

• Scala
• Java
• Python
12
Language
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

Components

• Spark Batch (or Core) is used to perform batch computations on sets
of data.
• Becoming the base language for much of the current stack of Hadoop
processing
– Mahout moves from MapReduce engine to Spark
• Expressed as a series of transformations and actions on your RDDs.
• Transformations are lazily applied once an action is invoked.
14
Spark Batch

• Module for structured data processing.
• Interaction modes include
– SQL
– DataFrames API
– Datasets API
• Can join sets of objects with tables
• Can be used to expose data sets to external applications
15
Spark SQL

Spark Streaming

• Combination of Spark SQL and Spark Streaming.
• Provides
– Fast
– Scalable
– Fault-tolerant
– End to end
– Exactly One
stream processing without the user having to reason about streaming.
• New in v2.1
• Hadoop vendors may not have implemented this functionality
17
Structured Streaming

• Spark’s Machine Learning Library
• ML Algorithms
– Classification
– Regression
– Clustering
– Collaborative Filtering
• Featurization
• Pipelines
• Persistence
• Data Science Utilities
18
Spark MLIB

• Graph processing
• Includes
– Graph abstraction
– Graph operations
– Pregel API
– Graph algorithms
– Graph Builders
19
GraphX

• Transformations
• Actions
20
Spark Development
T:filter
T:map
T:flatMap
A:forEach
T:map
A:save

• Start the REPL from the command line:
– spark-shell
• Creates Interactive Scala interpreter with Spark libraries
• Can add additional libraries into REPL on Launch
21
REPL

// Create a local StreamingContext with two working thread and batch
// interval of 60 second
SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("SparkDemo");
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.getOrCreate();
// Get Data
// Process Data
// Act on the Data
jssc.start();
try {
jssc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
22
Shell of our application

Get the Data
// Get Data
JavaReceiverInputDStream<Status> stream =
TwitterUtils.createStream(jssc, filters);

Transformations Actions
filter(func) reduce(func)
flatMap(func) collect()
map(func) count()
mapPartitions(func) first()
sample(withReplacement, fraction, seed) take(n)
union(otherDataset) takeSample(withReplacement, num, [seed])
intersection(otherDataset) takeOrdered(n, [ordering])
pipe(command, [envVars]) saveAsSequenceFile(path) (Java and Scala)
coalesce(numPartitions) saveAsObjectFile(path) (Java and Scala)
repartition(numPartitions) saveAsTextFile(path)
repartitionAndSortWithinPartitions(partitioner) countByKey()
join(otherDataset, [numTasks]) foreach(func)
cogroup(otherDataset, [numTasks])
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp,
[numTasks])
sortByKey([ascending], [numTasks])
24
Transformations and Actions

// Get Data
JavaReceiverInputDStream<Status> stream =
TwitterUtils.createStream(jssc, filters);
// Process Data
JavaDStream<Status> filteredStream = stream.filter(new
FilterNullGeoLocation());
JavaPairDStream<String, Long> geoHashCounts =
filteredStream
.mapToPair(new StatusToGeoHashCountPair());
// Act on the Data
geoHashCounts = geoHashCounts.reduceByKey(new
CombineCounts());
geoHashCounts.print();
geoHashCounts.foreachRDD(new SaveAsPlaces(spark));
25
Process the Data

Bring the Func
JavaPairDStream<String, Long> geoHashCounts =
filteredStream.mapToPair(new StatusToGeoHashCountPair());
public class StatusToGeoHashCountPair implements
PairFunction<Status, String, Long> {
@Override
public Tuple2<String, Long> call(Status status) throws
Exception {
return new Tuple2<String,
Long>(GeoHash.geoHashStringWithCharacterPrecision(
status.getGeoLocation().getLatitude(),
status.getGeoLocation().getLongitude(), 5),
1L);
}
}

public class StatusToGeoHashCountPair implements PairFunction<Status, String,
Long> {
@Override
public Tuple2<String, Long> call(Status status) throws Exception {
return new Tuple2<String, Long>
( GeoHash.geoHashStringWithCharacterPrecision(
status.getGeoLocation().getLatitude(),
status.getGeoLocation().getLongitude(), 5)
, 1L);
}
}
public class StatusToGeoHashCountPairTest {
@Test
public void getsExpectedResult() throws Exception {
Status status = mock(Status.class);
when(status.getGeoLocation()).thenReturn(new GeoLocation(10, -20));
Tuple2<String, Long> tuple = new StatusToGeoHashCountPair().call(status);
assertEquals("e9cbb", tuple._1());
assertEquals(new Long(1L), tuple._2());
}
}
27
Testing Your functions example

• Understanding closures
• ForEach; Connections
• Stream Start
28
Lessons Learned

• Non-deterministic run-time
• Testing a distributed application
29
Debugging

• Deploy each jar to each client
• Create a Maven Uberjar using Shade
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
30
Deployment

• Finding errors
• Keeping streaming timeline below processing line
31
Monitoring
#Trump
#McDonalds

Join Our Team
Contact:
Adam.doyle@daugherty.com

Spark: Building an application from Start to Finish

More Related Content

Spark: Building an application from Start to Finish