Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
©2015 IBM Corporation
Spark + Watson +
Twitter
DataPalooza SF 2015
David Taieb
STSM - IBM Cloud Data Services
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Introduction
©2015 IBM Corporation
Introduction
Our mission:
We are here to help developers realize their most ambitious projects.
Goals for today’s session:
•Introduction to real time analytics using Spark Streaming
•Technical Deep dive on the Spark + Watson + Twitter sample application
•At the end of this session, you should be able to download the source code and run the
application on IBM Analytics for Apache Spark
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
What is spark
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes
©2015 IBM Corporation
Spark Core Libraries
Spark CoreSpark Core
general compute engine, handles
distributed task dispatching, scheduling
and basic I/O functions
Spark
SQL
Spark
SQL
Spark
Streaming
Spark
Streaming
Mllib
(machine
learning)
Mllib
(machine
learning)
GraphX
(graph)
GraphX
(graph)
executes
SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
©2015 IBM Corporation
Key reasons for interest in Spark
Open SourceOpen Source
FastFast
distributed data
processing
distributed data
processing
ProductiveProductive
Web ScaleWeb Scale
•In-memory storage greatly reduces disk I/O
•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache
•Vibrant growing community of developers continuously improve code
base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure
•Scalable: easily increase number of worker nodes
•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling
•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases
•Rich and expressive apis hide complexities of parallel computing and worker node
management
•Support for Java, Scala, Python and R: less code written
•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
©2015 IBM Corporation
High level architecture
Spark Application
(driver)
Master
(cluster Manager)
Worker Node Worker Node
Worker Node Worker Node
…
Spark Cluster
Kernel
Master
(cluster Manager)
Worker Node Worker Node
…
Spark Cluster
Notebook Server
Browser
Http/WebSockets
Kernel Protocol (e.g ZeroMQ)
Batch Job
(Spark-Submit)
Interactive
Notebook
• RDD Partitioning
• Task packaging and
dispatching
• Worker node scheduling
©2015 IBM Corporation
Spark programming model lifecycle
Load data into RDDs
Apply transformation
into new RDDs
Apply Actions
(analytics) to produce
results
• In memory collection:
• sc.parallelize
• Unstructured data:
• Text: sc.textFile
• HDFS: sc.hadoopFile
• Structured data:
• Json: sqlCtxt.jsonFile
• Parquet: sqlCtxt.parquetFile
• Jdbc: sqlCtxt.load
• Custom data source: 1.4+
• Streaming data:
• TwitterUtils.createStream
• KafkaUtils.createStream
• FlumeUtils.createStream
• MQTTUtils.createStream
• Custom DStream
• Sc: SparkContext entry point: created by the application or automatically provided by Notebook
shell
• sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries
• Create new RDDs by applying transformations to
existing one
• map(fn): apply fn to all elements in RDD
• flatMap(fn): Same as map, fn can return 0 or more
elements
• filter(fn): select only elements for which fn returns
true
• reduceByKey
• sortByKey
• Sample: sample a fraction of data
• Union: combine elements of 2 RDDs
• Intersection: intersect 2 RDDS
• Distinct: remove duplicate elements
• ….
• Produce results from running analytics against
RDDs
• reduce(fn): perform summary operation on the
elements
• collect(): return all elements in an Array
• count(): count the number of elements in the
RDD
• take(n): return the first n elements in an Array
• foreach(fn): execute the fn on all the elements
in the RDD
• saveAsTextFile: persist the elements in a text
file
• ….
©2015 IBM Corporation
Job Scheduling
©2015 IBM Corporation
Ecosystem of the IBM Analytics for Apache
Spark as service
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Setup local development Environment
• Pre-requisites
- Scala runtime 2.10.4 http://www.scala-
lang.org/download/2.10.4.html
- Homebrew http://brew.sh/
- Scala sbt http://www.scala-sbt.org/download.html
- Spark 1.3.1
http://www.apache.org/dyn/closer.lua/spark/spark-
1.3.1/spark-1.3.1.tgz
• Detailled instructions here:
https://developer.ibm.com/clouddataservic
es/start-developing-with-spark-and-
notebooks/
©2015 IBM Corporation
Setup local development Environment contd..
• Create scala project using sbt
• Create directories to start from scratch
mkdir helloSpark && cd helloSpark
mkdir -p src/main/scala
mkdir -p src/main/java
mkdir -p src/main/resources
Create a subdirectory under src/main/scala
directory
mkdir -p com/ibm/cds/spark/sample
• Github URL for the same project
https://github.com/ibm-cds-
labs/spark.samples
©2015 IBM Corporation
Setup local development Environment contd..
• Create HelloSpark.scala using an IDE or a
text editor
• Copy paste this code snippetpackage com.ibm.cds.spark.samples
import org.apache.spark._
object HelloSpark {
    //main method invoked when running as a standalone Spark
Application
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Hello Spark")
        val spark = new SparkContext(conf)
 
        println("Hello Spark Demo. Compute the mean and variance of
a collection")
        val stats = computeStatsForCollection(spark);
        println(">>> Results: ")
        println(">>>>>>>Mean: " + stats._1 );
        println(">>>>>>>Variance: " + stats._2);
        spark.stop()
    }
 
    //Library method that can be invoked from Jupyter Notebook
    def computeStatsForCollection( spark: SparkContext,
countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double)
= {   
        val totalNumber = math.min( countPerPartitions * partitions,
©2015 IBM Corporation
Setup local development Environment contd..
• Create a file build.sbt under the project
root directory:
• Under the project root directory run
name := "helloSpark"
 
version := "1.0"
 
scalaVersion := "2.10.4"
 
libraryDependencies ++= {
    val sparkVersion =  "1.3.1"
    Seq(
        "org.apache.spark" %%
"spark-core" % sparkVersion,
        "org.apache.spark" %%
"spark-sql" % sparkVersion,
        "org.apache.spark" %%
"spark-repl" % sparkVersion
    )
} Download all
dependencies
$sbt update
Compile
$sbt compile
Package an
application jar
file
$sbt package
©2015 IBM Corporation
Hello World application on Bluemix Apache
Starter
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Introduction to Notebooks
‣ Notebooks allow creation of interactive
executable documents that include rich text
with Markdown, executable code with Scala,
Python or R, graphics with matplotlib
‣ Apache Spark provides multiple flavor APIs
that can be executed with a REPL shell:
Scala, Python (PYSpark), R
‣ Multiple open-source implementations
available:
- Jupyter: https://jupyter.org
- Apache Zeppelin: http://zeppelin-project.org
©2015 IBM Corporation
Notebook walkthrough
‣ Sign up on Bluemix
https://console.ng.bluemix.net/registration/
‣ Getting started with Analytics for Apache
Spark:
https://www.ng.bluemix.net/docs/services/Ana
lyticsforApacheSpark/index.html
‣ You can also follow tutorial here:
https://developer.ibm.com/clouddataservices/
start-developing-with-spark-and-notebooks/
©2015 IBM Corporation
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark Streaming
‣ “Spark Streaming is an extension of the core
Spark API that enables scalable, high-
throughput, fault-tolerant stream processing
of live data streams”
(http://spark.apache.org/docs/latest/streami
ng-programming-guide.html)
‣ Breakdown the Streaming data into smaller
pieces which are then sent to the Spark
Engine
©2015 IBM Corporation
Spark Streaming
‣ Provides connectors for multiple data
sources:
- Kafka
- Flume
- Twitter
- MQTT
- ZeroMQ
‣ Provides API to create custom connectors.
Lots of examples available on Github and
spark-packages.org
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark + Twitter + Watson application
‣ Use Spark Streaming in combination with IBM Watson to perform sentiment
analysis and track how a conversation is trending on Twitter.
‣ Use Spark Streaming to create a feed that captures live tweets from Twitter. You
can optionally filter the tweets that contain the hashtag(s) of your choice.
‣ The tweet data is then enriched in real time with various sentiment scores
provided by the Watson Tone Analyzer service (available on Bluemix). This service
provides insight into sentiment, or how the author feels.
‣ The data is then loaded and analyzed by the data scientist within Notebook.
‣ We can also use streaming analytics to feed a real-time web app dashboard
©2015 IBM Corporation
About this sample application
• Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter
• Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags
• A word about Scala
• Scala is Object oriented but also support functional programming style
• Bi-directional interoperability with Java
• Resources:
• Official web site: http://scala-lang.org
• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html
• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”
Watson Tone
Analyzer
Service
Bluemix
Producer
Stream
Enrich data with Emotion
Tone Scores
Processed data
Scala Notebook IPython
Notebook
Consumer
Stream
Message Hub
Service
Bluemix
Full Archive
Search API
Consumer Spark
Topics
Publish topics from
Spark analytics results
Event Hub
Service
Bluemix
Real-Time
Dashboard
Data Engineer
Business Analyst
C(Suite)
Data Scientist
©2015 IBM Corporation
Building a Spark Streaming application
Sentiment analysis with Twitter and Watson Tone Analyzer
‣Configure Twitter and Watson Tone Analyzer
1. Configure OAuth credentials for Twitter
2. Create a Watson Tone Analyzer Service on Bluemix
3. Configure MessageHub Service on Bluemix (Kafka)
4. Configure EventHub Service on Bluemix
©2015 IBM Corporation
Configure OAuth credentials for Twitter
‣You can follow
along the steps in
https://developer.ib
m.com/clouddataser
vices/sentiment-
analysis-of-twitter-
hashtags/#twitter
©2015 IBM Corporation
Create a Watson Tone Analyzer Service on Bluemix
‣You can follow along the steps in
https://developer.ibm.com/clouddataservices/
sentiment-analysis-of-twitter-
hashtags/#bluemix
©2015 IBM Corporation
Building a Spark Streaming application
Sentiment analysis with Twitter and Watson Tone Analyzer
‣Work with Twitter data
1. Create a Twitter Stream
2. Enrich the data with sentiment analysis from
Watson Tone Analyzer
3. Aggregate data into RDD with enriched Data model
4. Create SparkSQL DataFrame and register Table
©2015 IBM Corporation
Create a Twitter Stream
//Hold configuration key/value pairs
val config = Map[String, String](
("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ),
("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ),
("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ),
("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ),
("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")),
("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ),
("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ),
("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull )
)
Create a map that stores the credentials for the Twitter and Watson Service
config.foreach( (t:(String,String)) =>
if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 )
)
Twitter4j requires credentials to be store in System properties
©2015 IBM Corporation
Create a Twitter Stream
//Filter the tweets to only keeps the one with english as the language
//twitterStream is a discretized stream of twitter4j Status objects
var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None )
.filter { status =>
Option(status.getUser).flatMap[String] {
u => Option(u.getLang)
}.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the
language
&& CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII
&& ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor
}
Initial DStream
of Status Objects
©2015 IBM Corporation
Enrich the data with sentiment analysis from Watson
Tone Analyzer
//Broadcast the config to each worker node
val broadcastVar = sc.broadcast(config)
Initial DStream
of Status Objects
©2015 IBM Corporation
Enrich the data with sentiment analysis from Watson
Tone Analyzer
Initial DStream
of Status Objects
Data Model
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: integer (nullable = true)
|-- long: integer (nullable = true)
|-- Cheerfulness: double (nullable = true)
|-- Negative: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)
DStream of key,
value pairs
©2015 IBM Corporation
Aggregate data into RDD with enriched Data model
…..
//Aggregate the data from each DStream into the working RDD
rowTweets.foreachRDD( rdd => {
if ( rdd.count() > 0 ){
workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD )
}
})
Initial
DStream
RowTweets
Initial
DStream
RowTweets
Initial
DStream
RowTweets
….
Microbatches
Row 1
Row 2
Row 3
Row 4
…
…
Row n
workingRDD
Data Model
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: integer (nullable = true)
|-- long: integer (nullable = true)
|-- Cheerfulness: double (nullable = true)
|-- Negative: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)
©2015 IBM Corporation
Create SparkSQL DataFrame and register Table
//Create a SparkSQL DataFrame from the aggregate workingRDD
val df = sqlContext.createDataFrame( workingRDD, schemaTweets )
//Register a temporary table using the name "tweets"
df.registerTempTable("tweets")
println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContextvariable")
println("Here's the schema for tweets")
df.printSchema()
(sqlContext, df)
Row 1
Row 2
Row 3
Row 4
…
…
Row n
workingRDD
author date lang …
Cheerfulnes
s
Negative …
Conscientio
usness
John Smith
10/11/2015 –
20:18
en 0.0 65.8 … 25.5
Alfred … en 34.5 0.0 … 100.0
… … … … … …
… … … … … …
… … … … … …
Chris … en 85.3 22.9 … 0.0
Relational SparkSQL Table
©2015 IBM Corporation
Building a Spark Streaming application:
Sentiment analysis with Twitter and Watson Tone Analyzer
‣IPython Notebook analysis
1. Load the data into an IPython Notebook
2. Analytic 1: Compute the distribution of tweets by
sentiment scores greater than 60%
3. Analytic 2: Compute the top 10 hashtags contained
in the tweets
4. Analytic 3: Visualize aggregated sentiment scores
for the top 5 hashtags
©2015 IBM Corporation
Load the data into an IPython Notebook
‣ You can follow along the steps here: https://github.com/ibm-
cds-labs/spark.samples/blob/master/streaming-
twitter/notebook/Twitter%20%2B%20Watson%20Tone
%20Analyzer%20Part%202.ipynb
Create a SQLContext
from a SparkContext
Load from parquet file
and create a DataFrame
Create a SQL table and
start excuting SQL
queries
©2015 IBM Corporation
Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
#create an array that will hold the count for each sentiment
sentimentDistribution=[0] * 9
#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%
#Store the data in the array
for i, sentiment in enumerate(tweets.columns[-9:]):
sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")
.collect()[0].sentCount
©2015 IBM Corporation
Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
Use matplotlib to create a bar chart
©2015 IBM Corporation
Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
Bar Chart Visualization
©2015 IBM Corporation
Analytic 2: Compute the top 10 hashtags contained in
the tweets
Initial
Tweets
RDD
Filter
hashtags
Key, value
pair RDD
Reduced
map with
counts
Sorted
Map by key
flatMap filter map reduceByKey sortByKey
©2015 IBM Corporation
Analytic 2: Compute the top 10 hashtags contained in
the tweets
©2015 IBM Corporation
Analytic 2: Compute the top 10 hashtags contained in
the tweets
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
‣ Problem:
- Compute the mean average all the emotion score for
all the top 10 hastags
- Format the data in a way that can be consumed by the
plot script
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 1: Create RDD from tweets dataframe
tagsRDD = tweets.map(lambda t: t )
author … Cheerfulness
Jake … 0.0
Scrad … 23.5
Nittya Indika … 84.0
… … …
… … …
Madison … 93.0
tweets (Type: DataFrame)
Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0,
…)
Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’,
Cheerfulness=23.5, …)
Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’,
Cheerfulness=84.0, …)
…
…
Row(author=u’ Madison', …, text=u’ how many nights…’,
Cheerfulness=93.0, …)
tagsRDD (Type: RDD)
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 2: Filter to only keep the entries that are in top10tags
tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )
Row(author=u'Jake', …, text=u’@sarahwag…’,
Cheerfulness=0.0, …)
Row(author=u’Scrad', …, text=u’ #SuperBloodMoon
https://t…’, Cheerfulness=23.5, …)
Row(author=u’ Nittya Indika', …, text=u’ Good mornin!
http://t.…’, Cheerfulness=84.0, …)
…
…
Row(author=u’ Madison', …, text=u’ how many
nights…’, Cheerfulness=93.0, …)
Row(author=u'Mike McGuire', text=u'Explains my
disappointment #SuperBloodMoon
https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)
Row(author=u'Meng_tisoy', text=u’…hihi
#ALDUBThisMustBeLove https://t….’,
…,Conscientiousness=68.0)
Row(author=u'Kevin Contreras', text=u’…SILA!
#ALDUBThisMustBeLove', …Conscientiousness=68.0)
…
…
Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove
https://t…’,…, Conscientiousness=100.0)
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
#Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores
#for a particular tag with the following format: Tag-Tone-ToneScore
cols = tweets.columns[-9:]
def expand( t ):
ret = [ ]
for s in [i[0] for i in top10tags]:
if ( s in t.text ):
for tone in cols:
ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))]
return ret
tagsRDD = tagsRDD.flatMap( expand )
Row(author=u'Mike McGuire', text=u'Explains my
disappointment #SuperBloodMoon
https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)
Row(author=u'Meng_tisoy', text=u’…hihi
#ALDUBThisMustBeLove https://t….’,
…,Conscientiousness=68.0)
Row(author=u'Kevin Contreras', text=u’…SILA!
#ALDUBThisMustBeLove', …Conscientiousness=68.0)
…
Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove
https://t…’,…, Conscientiousness=100.0)
u'#SuperBloodMoon-Cheerfulness:0.0'
u'#SuperBloodMoon-Negative:100.0’
u'#SuperBloodMoon-Negative:23.5'
…
u'#ALDUBThisMustBeLove-Analytical:85.0’
FlatMap of encoded values
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 4: Create a map indexed by Tag-Tone keys
tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))
u'#SuperBloodMoon-Cheerfulness:0.0'
u'#SuperBloodMoon-Negative:100.0’
u'#SuperBloodMoon-Negativer:23.5'
…
u'#ALDUBThisMustBeLove-Analytical:85.0’
u'#SuperBloodMoon-
Cheerfulness'
0.0
u'#SuperBloodMoon-Negative’ 100.0
u'#SuperBloodMoon-Negative' 23.5
…
u'#ALDUBThisMustBeLove’ 85.0
map
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 5: Call combineByKey to format the data as follow
#Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone)
tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)),
(lambda x, y: (x[0] + y, x[1] + 1)),
(lambda x, y: (x[0] + y[0], x[1] + y[1])))
u'#SuperBloodMoon-
Cheerfulness'
0.0
u'#SuperBloodMoon-Negative’ 100.0
u'#SuperBloodMoon-Negative' 23.5
…
u'#ALDUBThisMustBeLove’ 85.0
u'#Supermoon-Confident’ (0.0, 3)
u'#HajjStampede-Tentative’ (0.0, 3)
u'#KiligKapamilya-
Conscientiousness’
(290.0, 6)
…
u'#LunarEclipse-Tentative’ (92.0, 4)
CreateCombiner: Create list of tuples (sum,count)
mergeValue: called for each new value (sum, count)
MergeCombiner: reduce part, merge 2 combiners
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple
#Key=Tag
#Value=(Tone, average_score)
tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1],2))))
u'#Supermoon-Confident’ (0.0, 3)
u'#HajjStampede-Tentative’ (0.0, 3)
u'#KiligKapamilya-
Conscientiousness’
(290.0, 6)
…
u'#LunarEclipse-Tentative’ (92.0, 4)
u'#Supermoon-Confident’ (u'Confident', 0.0)
u'#HajjStampede-Tentative’ (u'Tentative', 0.0)
u'#KiligKapamilya-
Conscientiousness’
(u'Conscientiousness',
48.33)
…
u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples
tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )
u'#Supermoon-Confident’ (u'Confident', 0.0)
u'#HajjStampede-Tentative’ (u'Tentative', 0.0)
u'#KiligKapamilya-
Conscientiousness’
(u'Conscientiousness',
48.33)
…
u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
u'#HajjStampede'
[(u'Tentative', 0.0), (u'Agreeableness',
3.67), …, (u'Cheerfulness', 100.0)]
u'#Supermoon'
[(u'Confident', 0.0), (u'Openness',
91.0), …, (u'Agreeableness',
20.33)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya
'
[(u'Conscientiousness', 48.33),
(u'Anger', 0.0),...
(u'Agreeableness', 10.83)]
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone
tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )
u'#HajjStampede'
[(u'Tentative', 0.0), (u'Agreeableness',
3.67), …, (u'Cheerfulness', 100.0)]
u'#Supermoon'
[(u'Confident', 0.0), (u'Openness',
91.0), …, (u'Agreeableness',
20.33)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya
'
[(u'Conscientiousness', 48.33),
(u'Anger', 0.0),...
(u'Agreeableness', 10.83)]
u'#HajjStampede'
[(u'Agreeableness', 3.67),
(u'Cheerfulness', 100.0),… (u'Tentative',
0.0),]
u'#Supermoon'
[(u'Agreeableness', 20.33),
(u'Confident', 0.0),..., (u'Openness',
91.0)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya'
[(u'Agreeableness', 10.83),
(u'Anger', 0.0)(u'Conscientiousness',
48.33),,...]
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 9 : Format the data as expected by the plotting code in the next cell.
#map the Values to a tuple as follow: ([list of tone], [list of average score])
tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) )
u'#HajjStampede'
[(u'Agreeableness', 3.67),
(u'Cheerfulness', 100.0),… (u'Tentative',
0.0),]
u'#Supermoon'
[(u'Agreeableness', 20.33),
(u'Confident', 0.0),..., (u'Openness',
91.0)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya'
[(u'Agreeableness', 10.83),
(u'Anger', 0.0)(u'Conscientiousness',
48.33),,...]
u'#HajjStampede'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [3.67, 100.0,…0.0])
u'#Supermoon'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[20.33, 0.0,… 91.0])
u'#bloodmoon'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…38.0])
…
u'#KiligKapamilya'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[10.83,
0.0,48.33,...])
Value is a tuple of 2 arrays: tones-scores
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags
def customCompare( key ):
for (k,v) in top10tags:
if k == key:
return v
return 0
tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)
u'#HajjStampede'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [3.67, 100.0,…0.0])
u'#Supermoon'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[20.33, 0.0,… 91.0])
u'#bloodmoon'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…38.0])
…
u'#KiligKapamilya'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[10.83,
0.0,48.33,...])
u'#Superbloodmon'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [33.97, 19.38,…12.85])
u'#BBWLA'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[38.33, 12.34,…
21.43])
u'#ALDUBThisMust
BeLove'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…62.0])
…
u'#Newmusic'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[0.0,
0.0,68.33,...])
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment
scores for the top 5 hashtags
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”
Watson Tone
Analyzer
Service
Bluemix
Producer
Stream
Enrich data with Emotion
Tone Scores
Processed data
Scala Notebook IPython
Notebook
Consumer
Stream
Message Hub
Service
Bluemix
Full Archive
Search API
Consumer Spark
Topics
Publish topics from
Spark analytics results
Event Hub
Service
Bluemix
Real-Time
Dashboard
Data Engineer
Business Analyst
C(Suite)
Data Scientist
©2015 IBM Corporation
Real-Time Web app Dashboard
‣ Pie chart showing
top Hashtags
distribution
‣ Bar chart showing
distribution of
tone scores for
each of top
HashTags
©2015 IBM Corporation
Create a Receiver that subscribes to Kafka
topics
Store new record into DStream
Get batch of new records
MessageHub on Bluemix requires Kafka 0.9
©2015 IBM Corporation
Create Kafka DStream
Implicit conversion to add synthetically add method to StreamingContext
©2015 IBM Corporation
Enrich Tweets with Watson Scores
Get Tone scores
Map to new EnrichedTweet Object
©2015 IBM Corporation
Streaming analytics
Prepare for Map/Reduce
Map tag-tone to corresponding score
Compute Count + Average for each score
Map each tag to count + List of scores averages
Reduce
©2015 IBM Corporation
Maintain State between micro-batch RDDs
Maintain State between micro-batches by recomputing
count and List of averages
©2015 IBM Corporation
Produce Streaming analytics topic data
Can’t call Kakfa Producer from streaming analytic
because not serializable
Post message to queue
Process
message
queue from
separate Thread
©2015 IBM Corporation
Real-time web app dashboard
‣ Technology used:
- Mozaik
(https://github.com/plo
uc/mozaik)
- ReactJS,
- WebSocket
- D3JS/C3JS
‣ Consume Topics
generated by Spark
Streaming analytics
Consumer Spark
Topics
Real-Time
Dashboard
Topics:
•topHashTags
•topHashTags.toneScores
©2015 IBM Corporation
Access MessageHub API through message-hub-rest node module
©2015 IBM Corporation
React Components for Mozaik framework
©2015 IBM Corporation
Demo!
©2015 IBM Corporation
Thank You

More Related Content

Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

  • 1. ©2015 IBM Corporation Spark + Watson + Twitter DataPalooza SF 2015 David Taieb STSM - IBM Cloud Data Services
  • 2. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 4. ©2015 IBM Corporation Introduction Our mission: We are here to help developers realize their most ambitious projects. Goals for today’s session: •Introduction to real time analytics using Spark Streaming •Technical Deep dive on the Spark + Watson + Twitter sample application •At the end of this session, you should be able to download the source code and run the application on IBM Analytics for Apache Spark
  • 5. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 6. ©2015 IBM Corporation What is spark Spark is an open source in-memory computing framework for distributed data processing and iterative analysis on massive data volumes
  • 7. ©2015 IBM Corporation Spark Core Libraries Spark CoreSpark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark SQL Spark Streaming Spark Streaming Mllib (machine learning) Mllib (machine learning) GraphX (graph) GraphX (graph) executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework
  • 8. ©2015 IBM Corporation Key reasons for interest in Spark Open SourceOpen Source FastFast distributed data processing distributed data processing ProductiveProductive Web ScaleWeb Scale •In-memory storage greatly reduces disk I/O •Up to 100x faster in memory, 10x faster on disk •Largest project and one of the most active on Apache •Vibrant growing community of developers continuously improve code base and extend capabilities •Fast adoption in the enterprise (IBM, Databricks, etc…) •Fault tolerant, seamlessly recompute lost data from hardware failure •Scalable: easily increase number of worker nodes •Flexible job execution: Batch, Streaming, Interactive •Easily handle Petabytes of data without special code handling •Compatible with existing Hadoop ecosystem •Unified programming model across a range of use cases •Rich and expressive apis hide complexities of parallel computing and worker node management •Support for Java, Scala, Python and R: less code written •Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
  • 9. ©2015 IBM Corporation High level architecture Spark Application (driver) Master (cluster Manager) Worker Node Worker Node Worker Node Worker Node … Spark Cluster Kernel Master (cluster Manager) Worker Node Worker Node … Spark Cluster Notebook Server Browser Http/WebSockets Kernel Protocol (e.g ZeroMQ) Batch Job (Spark-Submit) Interactive Notebook • RDD Partitioning • Task packaging and dispatching • Worker node scheduling
  • 10. ©2015 IBM Corporation Spark programming model lifecycle Load data into RDDs Apply transformation into new RDDs Apply Actions (analytics) to produce results • In memory collection: • sc.parallelize • Unstructured data: • Text: sc.textFile • HDFS: sc.hadoopFile • Structured data: • Json: sqlCtxt.jsonFile • Parquet: sqlCtxt.parquetFile • Jdbc: sqlCtxt.load • Custom data source: 1.4+ • Streaming data: • TwitterUtils.createStream • KafkaUtils.createStream • FlumeUtils.createStream • MQTTUtils.createStream • Custom DStream • Sc: SparkContext entry point: created by the application or automatically provided by Notebook shell • sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries • Create new RDDs by applying transformations to existing one • map(fn): apply fn to all elements in RDD • flatMap(fn): Same as map, fn can return 0 or more elements • filter(fn): select only elements for which fn returns true • reduceByKey • sortByKey • Sample: sample a fraction of data • Union: combine elements of 2 RDDs • Intersection: intersect 2 RDDS • Distinct: remove duplicate elements • …. • Produce results from running analytics against RDDs • reduce(fn): perform summary operation on the elements • collect(): return all elements in an Array • count(): count the number of elements in the RDD • take(n): return the first n elements in an Array • foreach(fn): execute the fn on all the elements in the RDD • saveAsTextFile: persist the elements in a text file • ….
  • 12. ©2015 IBM Corporation Ecosystem of the IBM Analytics for Apache Spark as service
  • 13. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 14. ©2015 IBM Corporation Setup local development Environment • Pre-requisites - Scala runtime 2.10.4 http://www.scala- lang.org/download/2.10.4.html - Homebrew http://brew.sh/ - Scala sbt http://www.scala-sbt.org/download.html - Spark 1.3.1 http://www.apache.org/dyn/closer.lua/spark/spark- 1.3.1/spark-1.3.1.tgz • Detailled instructions here: https://developer.ibm.com/clouddataservic es/start-developing-with-spark-and- notebooks/
  • 15. ©2015 IBM Corporation Setup local development Environment contd.. • Create scala project using sbt • Create directories to start from scratch mkdir helloSpark && cd helloSpark mkdir -p src/main/scala mkdir -p src/main/java mkdir -p src/main/resources Create a subdirectory under src/main/scala directory mkdir -p com/ibm/cds/spark/sample • Github URL for the same project https://github.com/ibm-cds- labs/spark.samples
  • 16. ©2015 IBM Corporation Setup local development Environment contd.. • Create HelloSpark.scala using an IDE or a text editor • Copy paste this code snippetpackage com.ibm.cds.spark.samples import org.apache.spark._ object HelloSpark {     //main method invoked when running as a standalone Spark Application     def main(args: Array[String]) {         val conf = new SparkConf().setAppName("Hello Spark")         val spark = new SparkContext(conf)           println("Hello Spark Demo. Compute the mean and variance of a collection")         val stats = computeStatsForCollection(spark);         println(">>> Results: ")         println(">>>>>>>Mean: " + stats._1 );         println(">>>>>>>Variance: " + stats._2);         spark.stop()     }       //Library method that can be invoked from Jupyter Notebook     def computeStatsForCollection( spark: SparkContext, countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double) = {            val totalNumber = math.min( countPerPartitions * partitions,
  • 17. ©2015 IBM Corporation Setup local development Environment contd.. • Create a file build.sbt under the project root directory: • Under the project root directory run name := "helloSpark"   version := "1.0"   scalaVersion := "2.10.4"   libraryDependencies ++= {     val sparkVersion =  "1.3.1"     Seq(         "org.apache.spark" %% "spark-core" % sparkVersion,         "org.apache.spark" %% "spark-sql" % sparkVersion,         "org.apache.spark" %% "spark-repl" % sparkVersion     ) } Download all dependencies $sbt update Compile $sbt compile Package an application jar file $sbt package
  • 18. ©2015 IBM Corporation Hello World application on Bluemix Apache Starter
  • 19. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 20. ©2015 IBM Corporation Introduction to Notebooks ‣ Notebooks allow creation of interactive executable documents that include rich text with Markdown, executable code with Scala, Python or R, graphics with matplotlib ‣ Apache Spark provides multiple flavor APIs that can be executed with a REPL shell: Scala, Python (PYSpark), R ‣ Multiple open-source implementations available: - Jupyter: https://jupyter.org - Apache Zeppelin: http://zeppelin-project.org
  • 21. ©2015 IBM Corporation Notebook walkthrough ‣ Sign up on Bluemix https://console.ng.bluemix.net/registration/ ‣ Getting started with Analytics for Apache Spark: https://www.ng.bluemix.net/docs/services/Ana lyticsforApacheSpark/index.html ‣ You can also follow tutorial here: https://developer.ibm.com/clouddataservices/ start-developing-with-spark-and-notebooks/
  • 23. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 24. ©2015 IBM Corporation Spark Streaming ‣ “Spark Streaming is an extension of the core Spark API that enables scalable, high- throughput, fault-tolerant stream processing of live data streams” (http://spark.apache.org/docs/latest/streami ng-programming-guide.html) ‣ Breakdown the Streaming data into smaller pieces which are then sent to the Spark Engine
  • 25. ©2015 IBM Corporation Spark Streaming ‣ Provides connectors for multiple data sources: - Kafka - Flume - Twitter - MQTT - ZeroMQ ‣ Provides API to create custom connectors. Lots of examples available on Github and spark-packages.org
  • 26. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 27. ©2015 IBM Corporation Spark + Twitter + Watson application ‣ Use Spark Streaming in combination with IBM Watson to perform sentiment analysis and track how a conversation is trending on Twitter. ‣ Use Spark Streaming to create a feed that captures live tweets from Twitter. You can optionally filter the tweets that contain the hashtag(s) of your choice. ‣ The tweet data is then enriched in real time with various sentiment scores provided by the Watson Tone Analyzer service (available on Bluemix). This service provides insight into sentiment, or how the author feels. ‣ The data is then loaded and analyzed by the data scientist within Notebook. ‣ We can also use streaming analytics to feed a real-time web app dashboard
  • 28. ©2015 IBM Corporation About this sample application • Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter • Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags • A word about Scala • Scala is Object oriented but also support functional programming style • Bi-directional interoperability with Java • Resources: • Official web site: http://scala-lang.org • Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html • Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o
  • 29. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 30. ©2015 IBM Corporation Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer” Watson Tone Analyzer Service Bluemix Producer Stream Enrich data with Emotion Tone Scores Processed data Scala Notebook IPython Notebook Consumer Stream Message Hub Service Bluemix Full Archive Search API Consumer Spark Topics Publish topics from Spark analytics results Event Hub Service Bluemix Real-Time Dashboard Data Engineer Business Analyst C(Suite) Data Scientist
  • 31. ©2015 IBM Corporation Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer ‣Configure Twitter and Watson Tone Analyzer 1. Configure OAuth credentials for Twitter 2. Create a Watson Tone Analyzer Service on Bluemix 3. Configure MessageHub Service on Bluemix (Kafka) 4. Configure EventHub Service on Bluemix
  • 32. ©2015 IBM Corporation Configure OAuth credentials for Twitter ‣You can follow along the steps in https://developer.ib m.com/clouddataser vices/sentiment- analysis-of-twitter- hashtags/#twitter
  • 33. ©2015 IBM Corporation Create a Watson Tone Analyzer Service on Bluemix ‣You can follow along the steps in https://developer.ibm.com/clouddataservices/ sentiment-analysis-of-twitter- hashtags/#bluemix
  • 34. ©2015 IBM Corporation Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer ‣Work with Twitter data 1. Create a Twitter Stream 2. Enrich the data with sentiment analysis from Watson Tone Analyzer 3. Aggregate data into RDD with enriched Data model 4. Create SparkSQL DataFrame and register Table
  • 35. ©2015 IBM Corporation Create a Twitter Stream //Hold configuration key/value pairs val config = Map[String, String]( ("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ), ("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ), ("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ), ("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ), ("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")), ("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ), ("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ), ("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull ) ) Create a map that stores the credentials for the Twitter and Watson Service config.foreach( (t:(String,String)) => if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 ) ) Twitter4j requires credentials to be store in System properties
  • 36. ©2015 IBM Corporation Create a Twitter Stream //Filter the tweets to only keeps the one with english as the language //twitterStream is a discretized stream of twitter4j Status objects var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None ) .filter { status => Option(status.getUser).flatMap[String] { u => Option(u.getLang) }.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the language && CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII && ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor } Initial DStream of Status Objects
  • 37. ©2015 IBM Corporation Enrich the data with sentiment analysis from Watson Tone Analyzer //Broadcast the config to each worker node val broadcastVar = sc.broadcast(config) Initial DStream of Status Objects
  • 38. ©2015 IBM Corporation Enrich the data with sentiment analysis from Watson Tone Analyzer Initial DStream of Status Objects Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true) DStream of key, value pairs
  • 39. ©2015 IBM Corporation Aggregate data into RDD with enriched Data model ….. //Aggregate the data from each DStream into the working RDD rowTweets.foreachRDD( rdd => { if ( rdd.count() > 0 ){ workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD ) } }) Initial DStream RowTweets Initial DStream RowTweets Initial DStream RowTweets …. Microbatches Row 1 Row 2 Row 3 Row 4 … … Row n workingRDD Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)
  • 40. ©2015 IBM Corporation Create SparkSQL DataFrame and register Table //Create a SparkSQL DataFrame from the aggregate workingRDD val df = sqlContext.createDataFrame( workingRDD, schemaTweets ) //Register a temporary table using the name "tweets" df.registerTempTable("tweets") println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContextvariable") println("Here's the schema for tweets") df.printSchema() (sqlContext, df) Row 1 Row 2 Row 3 Row 4 … … Row n workingRDD author date lang … Cheerfulnes s Negative … Conscientio usness John Smith 10/11/2015 – 20:18 en 0.0 65.8 … 25.5 Alfred … en 34.5 0.0 … 100.0 … … … … … … … … … … … … … … … … … … Chris … en 85.3 22.9 … 0.0 Relational SparkSQL Table
  • 41. ©2015 IBM Corporation Building a Spark Streaming application: Sentiment analysis with Twitter and Watson Tone Analyzer ‣IPython Notebook analysis 1. Load the data into an IPython Notebook 2. Analytic 1: Compute the distribution of tweets by sentiment scores greater than 60% 3. Analytic 2: Compute the top 10 hashtags contained in the tweets 4. Analytic 3: Visualize aggregated sentiment scores for the top 5 hashtags
  • 42. ©2015 IBM Corporation Load the data into an IPython Notebook ‣ You can follow along the steps here: https://github.com/ibm- cds-labs/spark.samples/blob/master/streaming- twitter/notebook/Twitter%20%2B%20Watson%20Tone %20Analyzer%20Part%202.ipynb Create a SQLContext from a SparkContext Load from parquet file and create a DataFrame Create a SQL table and start excuting SQL queries
  • 43. ©2015 IBM Corporation Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% #create an array that will hold the count for each sentiment sentimentDistribution=[0] * 9 #For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60% #Store the data in the array for i, sentiment in enumerate(tweets.columns[-9:]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60") .collect()[0].sentCount
  • 44. ©2015 IBM Corporation Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% Use matplotlib to create a bar chart
  • 45. ©2015 IBM Corporation Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% Bar Chart Visualization
  • 46. ©2015 IBM Corporation Analytic 2: Compute the top 10 hashtags contained in the tweets Initial Tweets RDD Filter hashtags Key, value pair RDD Reduced map with counts Sorted Map by key flatMap filter map reduceByKey sortByKey
  • 47. ©2015 IBM Corporation Analytic 2: Compute the top 10 hashtags contained in the tweets
  • 48. ©2015 IBM Corporation Analytic 2: Compute the top 10 hashtags contained in the tweets
  • 49. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags ‣ Problem: - Compute the mean average all the emotion score for all the top 10 hastags - Format the data in a way that can be consumed by the plot script
  • 50. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 1: Create RDD from tweets dataframe tagsRDD = tweets.map(lambda t: t ) author … Cheerfulness Jake … 0.0 Scrad … 23.5 Nittya Indika … 84.0 … … … … … … Madison … 93.0 tweets (Type: DataFrame) Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …) Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …) Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …) … … Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …) tagsRDD (Type: RDD)
  • 51. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 2: Filter to only keep the entries that are in top10tags tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) ) Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …) Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …) Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …) … … Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …) Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0) Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’, …,Conscientiousness=68.0) Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0) … … Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)
  • 52. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores #for a particular tag with the following format: Tag-Tone-ToneScore cols = tweets.columns[-9:] def expand( t ): ret = [ ] for s in [i[0] for i in top10tags]: if ( s in t.text ): for tone in cols: ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))] return ret tagsRDD = tagsRDD.flatMap( expand ) Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0) Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’, …,Conscientiousness=68.0) Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0) … Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0) u'#SuperBloodMoon-Cheerfulness:0.0' u'#SuperBloodMoon-Negative:100.0’ u'#SuperBloodMoon-Negative:23.5' … u'#ALDUBThisMustBeLove-Analytical:85.0’ FlatMap of encoded values
  • 53. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 4: Create a map indexed by Tag-Tone keys tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) )) u'#SuperBloodMoon-Cheerfulness:0.0' u'#SuperBloodMoon-Negative:100.0’ u'#SuperBloodMoon-Negativer:23.5' … u'#ALDUBThisMustBeLove-Analytical:85.0’ u'#SuperBloodMoon- Cheerfulness' 0.0 u'#SuperBloodMoon-Negative’ 100.0 u'#SuperBloodMoon-Negative' 23.5 … u'#ALDUBThisMustBeLove’ 85.0 map
  • 54. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 5: Call combineByKey to format the data as follow #Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone) tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1]))) u'#SuperBloodMoon- Cheerfulness' 0.0 u'#SuperBloodMoon-Negative’ 100.0 u'#SuperBloodMoon-Negative' 23.5 … u'#ALDUBThisMustBeLove’ 85.0 u'#Supermoon-Confident’ (0.0, 3) u'#HajjStampede-Tentative’ (0.0, 3) u'#KiligKapamilya- Conscientiousness’ (290.0, 6) … u'#LunarEclipse-Tentative’ (92.0, 4) CreateCombiner: Create list of tuples (sum,count) mergeValue: called for each new value (sum, count) MergeCombiner: reduce part, merge 2 combiners
  • 55. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple #Key=Tag #Value=(Tone, average_score) tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1],2)))) u'#Supermoon-Confident’ (0.0, 3) u'#HajjStampede-Tentative’ (0.0, 3) u'#KiligKapamilya- Conscientiousness’ (290.0, 6) … u'#LunarEclipse-Tentative’ (92.0, 4) u'#Supermoon-Confident’ (u'Confident', 0.0) u'#HajjStampede-Tentative’ (u'Tentative', 0.0) u'#KiligKapamilya- Conscientiousness’ (u'Conscientiousness', 48.33) … u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
  • 56. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) ) u'#Supermoon-Confident’ (u'Confident', 0.0) u'#HajjStampede-Tentative’ (u'Tentative', 0.0) u'#KiligKapamilya- Conscientiousness’ (u'Conscientiousness', 48.33) … u'#LunarEclipse-Tentative’ (u'Tentative', 23.0) u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)] u'#Supermoon' [(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya ' [(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]
  • 57. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) ) u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)] u'#Supermoon' [(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya ' [(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)] u'#HajjStampede' [(u'Agreeableness', 3.67), (u'Cheerfulness', 100.0),… (u'Tentative', 0.0),] u'#Supermoon' [(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya' [(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]
  • 58. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 9 : Format the data as expected by the plotting code in the next cell. #map the Values to a tuple as follow: ([list of tone], [list of average score]) tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) ) u'#HajjStampede' [(u'Agreeableness', 3.67), (u'Cheerfulness', 100.0),… (u'Tentative', 0.0),] u'#Supermoon' [(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya' [(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...] u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0]) u'#Supermoon' ([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0]) u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0]) … u'#KiligKapamilya' ([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...]) Value is a tuple of 2 arrays: tones-scores
  • 59. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags def customCompare( key ): for (k,v) in top10tags: if k == key: return v return 0 tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare) u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0]) u'#Supermoon' ([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0]) u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0]) … u'#KiligKapamilya' ([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...]) u'#Superbloodmon' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [33.97, 19.38,…12.85]) u'#BBWLA' ([u'Agreeableness’,u'Confident',..., u'Openness’],[38.33, 12.34,… 21.43]) u'#ALDUBThisMust BeLove' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…62.0]) … u'#Newmusic' ([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[0.0, 0.0,68.33,...])
  • 60. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
  • 61. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
  • 62. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 63. ©2015 IBM Corporation Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer” Watson Tone Analyzer Service Bluemix Producer Stream Enrich data with Emotion Tone Scores Processed data Scala Notebook IPython Notebook Consumer Stream Message Hub Service Bluemix Full Archive Search API Consumer Spark Topics Publish topics from Spark analytics results Event Hub Service Bluemix Real-Time Dashboard Data Engineer Business Analyst C(Suite) Data Scientist
  • 64. ©2015 IBM Corporation Real-Time Web app Dashboard ‣ Pie chart showing top Hashtags distribution ‣ Bar chart showing distribution of tone scores for each of top HashTags
  • 65. ©2015 IBM Corporation Create a Receiver that subscribes to Kafka topics Store new record into DStream Get batch of new records MessageHub on Bluemix requires Kafka 0.9
  • 66. ©2015 IBM Corporation Create Kafka DStream Implicit conversion to add synthetically add method to StreamingContext
  • 67. ©2015 IBM Corporation Enrich Tweets with Watson Scores Get Tone scores Map to new EnrichedTweet Object
  • 68. ©2015 IBM Corporation Streaming analytics Prepare for Map/Reduce Map tag-tone to corresponding score Compute Count + Average for each score Map each tag to count + List of scores averages Reduce
  • 69. ©2015 IBM Corporation Maintain State between micro-batch RDDs Maintain State between micro-batches by recomputing count and List of averages
  • 70. ©2015 IBM Corporation Produce Streaming analytics topic data Can’t call Kakfa Producer from streaming analytic because not serializable Post message to queue Process message queue from separate Thread
  • 71. ©2015 IBM Corporation Real-time web app dashboard ‣ Technology used: - Mozaik (https://github.com/plo uc/mozaik) - ReactJS, - WebSocket - D3JS/C3JS ‣ Consume Topics generated by Spark Streaming analytics Consumer Spark Topics Real-Time Dashboard Topics: •topHashTags •topHashTags.toneScores
  • 72. ©2015 IBM Corporation Access MessageHub API through message-hub-rest node module
  • 73. ©2015 IBM Corporation React Components for Mozaik framework