Slides from August 2017 STLHUG User Group. Adam Doyle discussed GDPR and building Spark applications..
Report
Share
Report
Share
1 of 33
More Related Content
Spark: Building an application from Start to Finish
1. Confidential and Proprietary to Daugherty Business Solutions
Spark: Building an application from Start
to Finish
Adam Doyle
STLHUG August 2017
2. Confidential & Proprietary to Daugherty Business Solutions.
EIM and Analytics
Data Science
• Predictive and Prescriptive Analytics
• Social, Text and Sentiment Analytics
• Natural Language Processing
• Machine Learning, Artificial Intelligence
• SPSS, SAS, R, IBM Watson™
Strategy and Competency Building
• Build the right, comprehensive solution blueprint across
12 Domains
• Establish, specific, actionable plan and ROIs
• Protecting your investments
• Organization, Talent, Competency
• Processes, Methods, Techniques, Tools
• Speed – Agile EIM Transformation
• Governance processes
Customer and Business Analytics
• Customer/Buyer/Channel Segmentation
• Persona Development, Customer Scoring (Value, Potential)
• Attrition Modeling, Engagement and Response Modeling
• Inventory Management, Marketing Campaigns
• Product Design Analytics, Workforce Planning, Location
Based Advertising
• Data Monetization
Traditional Data Warehouse and Business
Intelligence
• EDW, ODS, Data Mart and Integration
• Master Data Management
• Data Governance
• Dashboards, Scorecards,
• Reports , Alerts
• Multidimensional Analysis
• Ad hoc slicing and dicing
• Self Service Enablement
• Cloud Migration and Agile EIM
ANALYTICS
STRATEGY
EIM and
ANALYTICS
400+ employees
strong
Digital Engagement/Analytics
• Customer Engagement Strategies
• Omni-channel and Integrated Marketing
• Strategic Planning, Building and Executing
Digital and Customer Engagement Solutions.
Big Data and Next
Generation Technologies
• Data Lab Development Centers
• Data Lakes, Analytic Platforms
• Hadoop (Cloudera, Hortonworks)
• NoSQL / Graph DB (MongoDB, DataStax
• Cloud platforms (AWS, Google, Azure)
• Spark, Sqoop, Hive, Pig, Kafka, etc.
3. Confidential and Proprietary to Daugherty Business Solutions
• 20 year veteran of the St. Louis
IT community
• Co-Organizer, St. Louis Hadoop
User Group
• Big Data Community Lead,
Daugherty Business Solutions
• Formerly Big Data Solution
Architect at Amitech, Lead Big
Data developer at Mercy
• Speaker at local and national
Big Data conferences
Meet Adam Doyle
3
4. Confidential and Proprietary to Daugherty Business Solutions
• GDPR
• Why Spark
• Infrastructure
• Language
• Components
• Development
• Debugging
• Deployment
• Monitoring
• Questions
4
Agenda
5. Confidential and Proprietary to Daugherty Business Solutions
• If your company stores data about citizens from any of the countries in
the European Union, you should be preparing for GDPR.
• Here is the official link http://www.eugdpr.org/ Here is why:
(Financial) Penalties: Under GDPR organizations in breach of GDPR
can be fined up to 4% of annual global turnover (Revenue) or €20
Million (whichever is greater). This is the maximum fine that can
be imposed for the most serious infringements e.g. not having
sufficient customer consent to process data or violating the core of
Privacy by Design concepts.
5
GDPR
6. Confidential and Proprietary to Daugherty Business Solutions
• Privacy by Design is an approach to systems engineering which
takes privacy into account throughout the whole engineering process.
• Privacy by Design is not about data protection but designing so data
doesn't need protection. The root principle therefore is based on
enabling service without data control transfer from the citizen to the
system (the citizen become identifiable or recognizable).
• How Privacy by Design is achieved depends on the application,
technologies and choice of approach.
• Whether the methodology actually achieve Privacy by Design is not to
be evaluated based on intent or approach, but outcome. I.e. if data
do not need protection to not represent a risk to the citizens, the
principle of Privacy by Design can be said to be achieved.
• Even "anonymized" data are still personal data if data are derived from
personal data outside the control of the individual citizen in question
or any means to re-identify or recognize citizens or citizen devices
exist. Anonymous isn’t anonymous if you can reverse it.
6
Privacy by Design
7. Confidential and Proprietary to Daugherty Business Solutions
• One simple example is Dynamic Host Configuration Protocol (DHCP) where
devices based on random identifiers gets an IP from the server and thus is enabled
to communicate without having leaked personal identifiers per se.
• A more advanced example is Global Positioning System where devices client-side
can detect their geographical location without leaking identity or location.
• Another example in Internet of Things is RFID where citizens' ability to
communicate with their devices without leaking identifiers can be achieved
using Zero-knowledge proof.
7
Good Privacy by Design
8. Confidential and Proprietary to Daugherty Business Solutions
• Is your company doing business with European entities?
• Have you analyzed your IT infrastructure for privacy leakage?
• Can someone reverse your anonymity algorithms?
• How can you better address privacy concerns?
• “The cloud industry is aware of GDPR. We're actually scrambling to find
more IT consulting firms/partners who could handle this to come on
board with us. Many of the U.S. companies, some in St. Louis, that do
business with EU, have data stored in EU, or are acquired by/acquiring E
• https://www.datanami.com/this-just-in/mapr-talend-collaborate-deliver-
governed-gdpr-data-lake-solution/
• https://hortonworks.com/webinar/apache-atlas-ranger-can-help-become-
gdpr-compliant/
8
Discussion
9. Confidential and Proprietary to Daugherty Business Solutions 9
Why Spark in Pictures
https://www.sigmoid.com/apache-spark-internals/
10. Confidential and Proprietary to Daugherty Business Solutions
• The client wants to get a real-time view of where the tweets about
them are coming from.
10
Problem statement
11. Confidential and Proprietary to Daugherty Business Solutions
When deploying Spark for use, you have a couple of options
11
Infrastructure
12. Confidential and Proprietary to Daugherty Business Solutions
• Scala
• Java
• Python
12
Language
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
14. Confidential and Proprietary to Daugherty Business Solutions
• Spark Batch (or Core) is used to perform batch computations on sets
of data.
• Becoming the base language for much of the current stack of Hadoop
processing
– Mahout moves from MapReduce engine to Spark
• Expressed as a series of transformations and actions on your RDDs.
• Transformations are lazily applied once an action is invoked.
14
Spark Batch
15. Confidential and Proprietary to Daugherty Business Solutions
• Module for structured data processing.
• Interaction modes include
– SQL
– DataFrames API
– Datasets API
• Can join sets of objects with tables
• Can be used to expose data sets to external applications
15
Spark SQL
17. Confidential and Proprietary to Daugherty Business Solutions
• Combination of Spark SQL and Spark Streaming.
• Provides
– Fast
– Scalable
– Fault-tolerant
– End to end
– Exactly One
stream processing without the user having to reason about streaming.
• New in v2.1
• Hadoop vendors may not have implemented this functionality
17
Structured Streaming
18. Confidential and Proprietary to Daugherty Business Solutions
• Spark’s Machine Learning Library
• ML Algorithms
– Classification
– Regression
– Clustering
– Collaborative Filtering
• Featurization
• Pipelines
• Persistence
• Data Science Utilities
18
Spark MLIB
19. Confidential and Proprietary to Daugherty Business Solutions
• Graph processing
• Includes
– Graph abstraction
– Graph operations
– Pregel API
– Graph algorithms
– Graph Builders
19
GraphX
20. Confidential and Proprietary to Daugherty Business Solutions
• Transformations
• Actions
20
Spark Development
T:filter
T:map
T:flatMap
A:forEach
T:map
A:save
21. Confidential and Proprietary to Daugherty Business Solutions
• Start the REPL from the command line:
– spark-shell
• Creates Interactive Scala interpreter with Spark libraries
• Can add additional libraries into REPL on Launch
21
REPL
22. Confidential and Proprietary to Daugherty Business Solutions
// Create a local StreamingContext with two working thread and batch
// interval of 60 second
SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("SparkDemo");
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(60));
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.getOrCreate();
// Get Data
// Process Data
// Act on the Data
jssc.start();
try {
jssc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
22
Shell of our application
23. Confidential and Proprietary to Daugherty Business Solutions 23
Get the Data
// Get Data
JavaReceiverInputDStream<Status> stream =
TwitterUtils.createStream(jssc, filters);
25. Confidential and Proprietary to Daugherty Business Solutions
// Get Data
JavaReceiverInputDStream<Status> stream =
TwitterUtils.createStream(jssc, filters);
// Process Data
JavaDStream<Status> filteredStream = stream.filter(new
FilterNullGeoLocation());
JavaPairDStream<String, Long> geoHashCounts =
filteredStream
.mapToPair(new StatusToGeoHashCountPair());
// Act on the Data
geoHashCounts = geoHashCounts.reduceByKey(new
CombineCounts());
geoHashCounts.print();
geoHashCounts.foreachRDD(new SaveAsPlaces(spark));
25
Process the Data
26. Confidential and Proprietary to Daugherty Business Solutions 26
Bring the Func
JavaPairDStream<String, Long> geoHashCounts =
filteredStream.mapToPair(new StatusToGeoHashCountPair());
public class StatusToGeoHashCountPair implements
PairFunction<Status, String, Long> {
@Override
public Tuple2<String, Long> call(Status status) throws
Exception {
return new Tuple2<String,
Long>(GeoHash.geoHashStringWithCharacterPrecision(
status.getGeoLocation().getLatitude(),
status.getGeoLocation().getLongitude(), 5),
1L);
}
}
27. Confidential and Proprietary to Daugherty Business Solutions
public class StatusToGeoHashCountPair implements PairFunction<Status, String,
Long> {
@Override
public Tuple2<String, Long> call(Status status) throws Exception {
return new Tuple2<String, Long>
( GeoHash.geoHashStringWithCharacterPrecision(
status.getGeoLocation().getLatitude(),
status.getGeoLocation().getLongitude(), 5)
, 1L);
}
}
public class StatusToGeoHashCountPairTest {
@Test
public void getsExpectedResult() throws Exception {
Status status = mock(Status.class);
when(status.getGeoLocation()).thenReturn(new GeoLocation(10, -20));
Tuple2<String, Long> tuple = new StatusToGeoHashCountPair().call(status);
assertEquals("e9cbb", tuple._1());
assertEquals(new Long(1L), tuple._2());
}
}
27
Testing Your functions example
28. Confidential and Proprietary to Daugherty Business Solutions
• Understanding closures
• ForEach; Connections
• Stream Start
28
Lessons Learned
29. Confidential and Proprietary to Daugherty Business Solutions
• Non-deterministic run-time
• Testing a distributed application
29
Debugging
30. Confidential and Proprietary to Daugherty Business Solutions
• Deploy each jar to each client
• Create a Maven Uberjar using Shade
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
30
Deployment
31. Confidential and Proprietary to Daugherty Business Solutions
• Finding errors
• Keeping streaming timeline below processing line
31
Monitoring
#Trump
#McDonalds