Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
David	Taieb	
STSM	-	IBM	Cloud	Data	Services	
Developer	advocate		
david_taieb@us.ibm.com	
HANDS-ON	SESSION:		
DEVELOPING	ANALYTIC	APPLICATIONS	
USING	APACHE	SPARK™	AND	PYTHON	
	
Part	1:	Flight	Delay	Predict	with	Spark	ML	
PyCon	2016,	Portland
©2016	IBM	Corpora6on		
	
Agenda
•  Pre-requisite	steps	to	be	completed	before	
the	session	
•  Flight	Predict	app	descrip6on	and	architecture	
•  Train	the	models	in	the	Notebook	
•  Accuracy	Analysis	and	models	refinement	
•  Deploy	and	run	the	models
©2016	IBM	Corpora6on		
	
Sign up for Bluemix
•  Access	IBM	Bluemix	website	on	hMps://console.ng.bluemix.net	
•  Click	on	Get	Started	for	Free	
•  Complete	the	form	and	click	Create	account	
•  Look	for	confirma6on	email	and	click	on	confirm	you	account	link	
Sign	up	for	flightstats
©2016	IBM	Corpora6on		
	
Sign up for a free trial at Flightstats.com
•  Sign	up	at	hMps://developer.flightstats.com/signup	
•  Fill	out	the	form	and	monitor	email	for	confirma6on	link	(access	to	APIs	may	
take	up	to	24	hours)	
•  Once	access	is	granted	go	to	
hMps://developer.flightstats.com/admin/applica6ons	to	view	appId	and	
appKey	(you	will	need	them	in	the	simple-data-pipe	tool	to	create	training	
sets.	
•  Op6onal:	get	familiar	with	the	various	flightstats	apis:	
–  hMps://developer.flightstats.com/api-docs/scheduledFlights/v1	
–  hMps://developer.flightstats.com/api-docs/airports/v1	
	
How	to	find	your	app	id	and	key
©2016	IBM	Corpora6on		
	
Where to find the FlightStats app id and app key
APP	ID	
APP	Key	
Prepare	your	bluemix	space
©2016	IBM	Corpora6on		
	
Create a new space on Bluemix
In	prepara6on	for	running	the	project,	we	create	a	new	space	on	Bluemix		
Create	a	Spark	Instance	
Op6onal:	You	can	skip	this	step	if	you	already	have	a	
space	with	Spark	instance	that	you	would	like	to	reuse
©2016	IBM	Corpora6on		
	
Create a Spark Instance
Op6onal:	You	can	skip	this	step	if	you	already	have	a	
space	with	Spark	instance	that	you	would	like	to	reuse
©2016	IBM	Corpora6on		
	
Create New Spark Instance
Op6onal:	You	can	skip	this	step	if	you	already	have	a	
space	with	Spark	instance	that	you	would	like	to	reuse
©2016	IBM	Corpora6on		
	
Agenda
•  Pre-requisite	steps	to	be	completed	before	
the	session	
•  Flight	Predict	app	descrip6on	and	architecture	
•  Train	the	models	in	the	Notebook	
•  Accuracy	Analysis	and	models	refinement	
•  Deploy	and	run	the	models
©2016	IBM	Corpora6on		
	
Flight App Project Description
•  Use	case	
–  Flight	delays	are	a	common	disturbance	during	business	trips	
–  Being	able	to	predict	how	likely	a	flight	will	be	delayed	can	remove	uncertainty	and	enable	
users	to	plan	around	it.	
–  Idea:	Weather	data	can	be	a	good	explanatory	variable	for	building	predic6ve	models	
•  ImplementaSon	
–  Combine	flight	sta6s6cs	from	flightstats.com	(System	of	records)	with	weather	data	from	
IBM	Insight	for	Weather	(System	of	opera6ons)	to	build	a	training,	test	and	blind	set	
–  Use	Spark	MLLib	to	train	predic6ve	models	and	cross	validate	them	
–  Create	a	custom	card	for	Google	Now	that	will	automa6cally	no6fy	user	of	impending	
flight	delay	
–  Propose	alterna6ng	flight	routes	(e.g.	Freebird)	
Get/Build/Analyze
©2016	IBM	Corpora6on		
	
Get/Build/Analyze methodology
©2016	IBM	Corpora6on		
	
Flight Predict App Architecture
Weather	
Simple	Data	
Pipes	
Airports	
Flight	Schedules	
Flight	Status	
Metadata	
Training	
Set	
Test	
Set	
Blind	
Set	
Custom	Connector	
run	every	24	hours	
Notebook
©2016	IBM	Corpora6on		
	
Flow Diagram
Data	
Acquisi6on	
Data	
Prepara6on	
Data	Annota6on	
(Ground	Truth)	
Model	
Training	
•  Cleansing	
•  Shaping	
•  Enrichment	
Model	Tes6ng	
Training	
Set	
Test	
Set	
Blind	
Set	
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
•  Itera6ve	in	Nature:	we	are	never	done!	
•  We	will	be	using	this	diagram	as	a	roadmap	throughout	this	course	
Deploy	and	
Run	Model
©2016	IBM	Corpora6on		
	
Get the data and build the training/test/blind sets
In	this	step	we’ll	use	Simple	Data	Pipes	open	source	project	to	acquire	data	from	
Flightstats,	combine	it	with	Weather	data	from	IBM	Insight	for	Weather	and	save	
the	data	sets	into	a	NoSQL	Cloudant	Database.	
Data	
Acquisi6on	
Data	
Prepara6on	
Data	Annota6on	
(Ground	Truth)	
Model	
Training	
•  Cleansing	
•  Shaping	
•  Enrichment	
Model	Tes6ng	
Training	
Set	
Test	
Set	
Blind	
Set	
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
Deploy	and	
Run	Model
©2016	IBM	Corpora6on		
	
Acquiring the data
•  In	the	next	sec6on,	we	show	how	to	acquire	the	training	data	by	
using	the	simple-data-pipe	tool	and	flight	predict	connector.	
•  The	flight	predict	connector	combine	historical	flight	data	from	
flightstats.com	with	weather	data	from	IBM	Insight	for	Weather	
•  If	you	want	to	skip	these	steps,	you	can	use	the	already	built	
dataset	by	using	the	following	creden6als:	
–  cloudantHost:	dtaieb.cloudant.com	
–  cloudantUserName:	weenesserliffircedinvers	
–  cloudantPassword:	72a5c4f939a9e2578698029d2bb041d775d088b5	
Deploy	simple-data-pipe
©2016	IBM	Corpora6on		
	
Deploy simple-data-pipe with flightstats connector
•  Go	to	hMps://github.com/ibm-cds-labs/simple-data-pipe	
•  Click	on	Deploy	to	Bluemix	buMon	
Click	buMon	will	take	you	to	Bluemix
©2016	IBM	Corpora6on		
	
Complete simple-data-pipe deployment
Add	Weather	service
©2016	IBM	Corpora6on		
	
Add an instance of IBM Weather Service on Bluemix
•  Return	to	the	applica6on	dashboard	
•  Weather	service	is	required	by	the	
flight	predict	connector	and	must	be	
installed	before	
•  From	app	dashboard,	click	on	Add	a	
service	or	API
©2016	IBM	Corpora6on		
	
Create an instance of IBM Weather Service on Bluemix
Search	for	Weather	
Make	sure	to	select	
“premium	plan”	to	have	
enough	authorized	API	calls
©2016	IBM	Corpora6on		
	
Checkpoint: simple data pipe app dashboard
•  Verify	that	your	app	is	correctly	bound	to	the	right	services	
Weather	Service	used	to	enrich	
flight	records	with	weather	
observa6ons	
Cloudant	Service	used	
to	store	training,	test	
and	blind	data	sets	
You’ll	need	to	click	on	this	buMon	
for	the	step	on	the	next	page	It	is	recommended	to	increase	
the	app	memory	to	1GB
©2016	IBM	Corpora6on		
	
Install flight predict connector
•  Click	Edit	Code	buMon,	edit	package.json	to	add	flight	predict	module:	
– "simple-data-pipe-connector-flightstats":"git://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git"	
add	flight	predict	module	to	dependencies	
Save	your	changes	
don’t	forget	to	add	comma	in	the	line	before	to	keep	json	valid
©2016	IBM	Corpora6on		
	
Install flight predict connector
•  Click	File/Save	to	save	your	changes	
Redeploy	simple	data	pipe
©2016	IBM	Corpora6on		
	
Redeploy simple data pipe app
•  Use	live	edit	Editor	to	redeploy	the	app	
Verify	your	sdp	install
©2016	IBM	Corpora6on		
	
Verify connector install
•  In	this	step,	we	verify	that	the	flight	predict	connector	is	correctly	installed	through	the	UI	
Fight	connector	correctly	installed	
Create	new	flightstats	pipe
©2016	IBM	Corpora6on		
	
Create a new FlightStats pipe
•  Follow	each	screen	to	create	and	configure	a	new	pipe	
Run	the	pipe
©2016	IBM	Corpora6on		
	
Run the pipe
•  Skip	over	the	schedule	tab	
•  In	the	ac6vity	tab,	click	on	Run	Now	to	start	the	pipe	
Explore	the	data	set	
Click	Run	Now	
Then	open	the	log	to	monitor	the	ac6vity
©2016	IBM	Corpora6on		
	
Explore the data sets
•  In	this	step,	we	take	a	moment	to	explore	the	different	data	sets	that	have	been	created	by	the	
simple	data	pipe	tool	
•  From	bluemix	dashboard,	click	on	the	cloudant	service	6le,	then	on	the	Launch	buMon	
•  From	the	Cloudant	dashboard,	open	the	training	database	
•  Open	a	document	to	look	at	the	data	structure	
Build	the	test	set
©2016	IBM	Corpora6on		
	
Run the pipe again to build the test set
Train	the	models
©2016	IBM	Corpora6on		
	
Train the Models
•  In	the	previous	sec6on	we	have	created	the	training	data	and	we	are	now	ready	to	train	the	models.	
•  Steps	in	this	sec6on:	
–  Create	an	IPython	Notebook	
–  Load	the	data	sets	from	the	Cloudant	database	into	a	Spark	Cluster	
–  Explore	the	data	and	train	the	machine	learning	models	
Data	
Acquisi6on	
Data	
Prepara6on	
Data	Annota6on	
(Ground	Truth)	
Model	
Training	
•  Cleansing	
•  Shaping	
•  Enrichment	
Model	Tes6ng	
Training	
Set	
Test	
Set	
Blind	
Set	
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
Deploy	and	
Run	Model	
Create	IPython	Notebook
©2016	IBM	Corpora6on		
	
Create a new IPython Notebook
©2016	IBM	Corpora6on		
	
Notebook tour
©2016	IBM	Corpora6on		
	
Notebook tour: Notebook Info
©2016	IBM	Corpora6on		
	
Notebook tour: Environment
©2016	IBM	Corpora6on		
	
Notebook tour: Sharing
`
©2016	IBM	Corpora6on		
	
Agenda
•  Pre-requisite	steps	to	be	completed	before	
the	session	
•  Flight	Predict	app	descrip6on	and	architecture	
•  Train	the	models	in	the	Notebook	
•  Accuracy	Analysis	and	models	refinement	
•  Deploy	and	run	the	models
©2016	IBM	Corpora6on		
	
Before we start building the app…
•  You	can	op6onally	follow	this	tutorial	from	
Github	by	using	a	fully	built	notebook:	
– hMps://github.com/ibm-cds-labs/simple-data-
pipe-connector-flightstats/blob/master/
notebook/Flight%20Predict%20PyCon
%202016.ipynb
©2016	IBM	Corpora6on		
	
Optional: use prebuilt notebook
Import	required	Python	packages	
• Create	notebook	from	URL	
• Use	hMps://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/
raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
©2016	IBM	Corpora6on		
	
Using Python Packages
•  Write	code	inline	within	cells	
•  Encapsulate	helper	APIs	within	Python	package	
•  2	ways	of	using	helper	Python	packages	
–  egg	distribu6on	package:	pip	install	from	PyPi	server	or	file	server	
(e.g.	Github)	
•  Persistent	install	across	sessions	
•  Recommended	in	Produc6on	
–  SparkContext.addPyFile	
•  Easy	addi6on	of	a	python	module	file	
•  Support	mul6ple	module	files	via	zip	format	
•  Recommended	during	development	where	frequent	code	changes	occur	
Manage	egg	packages
©2016	IBM	Corpora6on		
	
Flight Predict Python Package on Github
Setup	script	for	installing	Python	Package	
Flight	Predict	Python	library
©2016	IBM	Corpora6on		
	
Method 1: Install Flight Predict Package
•  Use	pip	to	Install	Flight	Predict	package	
•  Recommended	alterna6ve:	build	egg	distribu6on	package	and	deploy	in	PyPi
©2016	IBM	Corpora6on		
	
Manage Python packages
•  Check	status	
•  Uninstall	package	
Install	packages	via	sc.addPyFile	method
©2016	IBM	Corpora6on		
	
Method 2: Install py modules via sc.addPyFile
•  addPyFile	install	individual	py	modules	and	make	them	available	to	all	executor	
processes	
•  Works	with	modules	in	zipped	files	
Module	containing	apis	for	training	the	models	
Module	containing	apis	for	running	the	models	
Configure	creden6als	for	various	services
©2016	IBM	Corpora6on		
	
Setup credentials and Import required python modules
In	this	step,	we	import	python	modules	that	will	be	needed	throughout	the	notebook	
and	setup	creden6als	to	various	services.	
How	to	get	creden6als	for	Cloudant	and	Weather	
Creden6al	for	Cloudant	NoSQL	Service	
Creden6als	for	Weather	Service
©2016	IBM	Corpora6on		
	
Get Credentials for Cloudant
From	the	app	dashboard,	click	on	Environment	Variables	from	the	les	sidebar
©2016	IBM	Corpora6on		
	
Get Credentials for Weather
Load	training	set	from	Cloudant
©2016	IBM	Corpora6on		
	
Load training set in Spark SQL DataFrame
…	
In	this	step,	we	use	the	cloudant-spark	connector	(hMps://github.com/cloudant-labs/spark-cloudant)	
to	load	data	into	Spark	
Make	sure	to	change	
the	db	name	to	match	
the	one	created	for	
your	training	set	by	
your	ac6vity	(open	the	
Cloudant	dashboard	to	
find	the	name)
©2016	IBM	Corpora6on		
	
Loading data: Behind the scene
Use	Spark	SQL	connector	to	load	data	into	a	DataFrame	
connector	id	
Op6ons	
Cache	data	for	op6mized	reuse	
Create	temp	SQL	Table	
ScaMer	Plot	Visualiza6on
©2016	IBM	Corpora6on		
	
Scatter plot visualization
©2016	IBM	Corpora6on		
	
Visualization api
Create	an	RDD	of	LabeledPoint
©2016	IBM	Corpora6on		
	
Transform into an RDD of LabeledPoint
Use	Spark	SQL	connector	to	load	data	into	a	DataFrame
©2016	IBM	Corpora6on		
	
loadLabeledDataRDD api
Train	Machine	Learning	Models
©2016	IBM	Corpora6on		
	
Machine Learning Algorithms
ConSnuous	Output	 Discrete	Output	
Supervised	Learning	
(require	Ground-Truth)	
•  Regression	
				-	Linear	
				-	Ridge	
				-	Lasso	
				-	Isotonic	
•  Decision	Tree	
•  RandomForest	
•  GradientBoostedTree	
• Classifica6on	
				-	Logis6c	Regression	
				-	SVM	
				-	NaiveBayes	
• Decision	Tree	
• RandomForest	
• GradientBoostedTree	
• K-NN	(available	as	add-on	spark	package)	
Unsupervised	Learning	
(no	Ground-Truth	data	required)	
•  Clustering	
				-	KMeans	
				-	Gaussian	Mixture	
•  Dimensionality	Reduc6on	
				-	PCA	
				-	SVD	
•  FP-Growth	
Train	Logis6c	Regression	Model
©2016	IBM	Corpora6on		
	
Train Logistic Regression Model
Train	Naïve	Bayes	Models
©2016	IBM	Corpora6on		
	
Train NaiveBayes Model
Train	decision	Tree	Model
©2016	IBM	Corpora6on		
	
Train Decision Tree Model
Train	Random	Forest	Model
©2016	IBM	Corpora6on		
	
Train Random Forest Model
Accuracy	Analysis
©2016	IBM	Corpora6on		
	
Naïve Bayes vs Decision Tree
•  Probabilis6c:	compute	the	probability	
of	a	data	instance	to	be	in	a	specific	
class	
•  Assume	that	each	feature	(variable)	is	
independent	from	the	others	
•  Performance	depends	on	the	predic6ve	
nature	of	the	features	(non	predic6ve	
features	will	affect	the	accuracy)	
•  Works	well	with	low	amount	of	training	
data.	Doesn’t	need	all	the	possibili6es	
•  Doesn’t	work	with	categorical	features.	
• Non-Probabilistic: partition the data into
subsets that best describe the variable
• The deeper the tree, the better the model
fits the data
• Watch out for overfiting: need to prune
the tree
• Can handle categorical or continuous
features
• No need for input to be scaled or
standardized: Set you features and go!
• Requires a lot of data covering all
possibilities
©2016	IBM	Corpora6on		
	
Accuracy Analysis of the Machine Learning Models
In	this	sec6on,	we	will	perform	accuracy	analysis	on	the	test	data.	We	will	start	by	
compu6ng	the	accuracy	metrics	for	each	model,	including	the	confusion	matrix.	We	
will	then	use	histogram	chart	to	understand	the	data	distribu6on	and	refine	how	to	
classes	are	computed.	
Data	
Acquisi6on	
Data	
Prepara6on	
Data	Annota6on	
(Ground	Truth)	
Model	
Training	
•  Cleansing	
•  Shaping	
•  Enrichment	
Model	Tes6ng	
Training	
Set	
Test	
Set	
Blind	
Set	
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
Deploy	and	
Run	Model
©2016	IBM	Corpora6on		
	
Agenda
•  Pre-requisite	steps	to	be	completed	before	
the	session	
•  Flight	Predict	app	descrip6on	and	architecture	
•  Train	the	models	in	the	Notebook	
•  Accuracy	Analysis	and	models	refinement	
•  Deploy	and	run	the	models
©2016	IBM	Corpora6on		
	
Load Test data
Make	sure	to	change	
the	db	name	to	match	
the	one	created	for	
your	test	set	by	your	
ac6vity	(open	the	
Cloudant	dashboard	to	
find	the	name)
©2016	IBM	Corpora6on		
	
Accuracy Metrics
©2016	IBM	Corpora6on		
	
Confusion Matrix
©2016	IBM	Corpora6on		
	
Confusion Matrix
©2016	IBM	Corpora6on		
	
Confusion Matrix
©2016	IBM	Corpora6on		
	
Confusion Matrix
©2016	IBM	Corpora6on		
	
Accuracy metrics API
Output	HTML	
Display	results	HTML	in	Notebook	Cell	
Compute	Metrics	from	labeled	and	predic6on	data	
Get	the	confusion	matrix	and	build	html	table
©2016	IBM	Corpora6on		
	
Understand the distribution of your data with Histograms
©2016	IBM	Corpora6on		
	
Training Handler class
•  Provide	flexibility	and	extensibility	to	the	
applica6on	
•  Provide	a	fail	fast	and	try	something	else	
mechanism	
•  Enable	user	to	easily	customize	classes	of	data	
based	on	how	data	is	distributed	
•  Enable	user	to	easily	add	training	features
©2016	IBM	Corpora6on		
	
Default Training Handler class
Return	descrip6on	for	each	classes	
Return	total	number	of	classes:	Default	is	5	
Re-classify	a	record:	default	uses	
s.classifica6on	field	in	Json	record	
Extra	features	Names	to	be	added.	None	by	default	
Extra	features	to	be	added.	Array	must	match	the	
one	returned	by	customTrainingFeaturesNames
©2016	IBM	Corpora6on		
	
Customize Training Handler
Provide	new	classifica6on	and	add	day	of	departure	as	a	new	feature	
Inherit	from	defaultTrainingHandler	
Add	day	of	the	week	using	a	technique	
called	dummy	coding
©2016	IBM	Corpora6on		
	
Re-train the models
©2016	IBM	Corpora6on		
	
Re-compute accuracy
Models	1	
Models	2	
BeMer	accuracy	for	NaiveBayes	
and	Logis6c	Regression	
Worse	for	DecisionTree	and	
RandomForest
©2016	IBM	Corpora6on		
	
Agenda
•  Pre-requisite	steps	to	be	completed	before	
the	session	
•  Flight	Predict	app	descrip6on	and	architecture	
•  Train	the	models	in	the	Notebook	
•  Accuracy	Analysis	and	models	refinement	
•  Deploy	and	run	the	models
©2016	IBM	Corpora6on		
	
Deploy and Run the models
In	the	last	sec6on,	we	will	simulate	deployment	and	running	of	the	models	
through	the	notebook	by	calling	APIs	from	the	run	package.	
Data	
Acquisi6on	
Data	
Prepara6on	
Data	Annota6on	
(Ground	Truth)	
Model	
Training	
•  Cleansing	
•  Shaping	
•  Enrichment	
Model	Tes6ng	
Training	
Set	
Test	
Set	
Blind	
Set	
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
Deploy	and	
Run	Models
©2016	IBM	Corpora6on		
	
Run the predictive model
©2016	IBM	Corpora6on		
	
runModel API
©2016	IBM	Corpora6on		
	
Get Weather Predictions
©2016	IBM	Corpora6on		
	
Show prediction results
©2016	IBM	Corpora6on		
	
Resource
•  hMps://developer.ibm.com/clouddataservices/	
•  hMps://github.com/ibm-cds-labs/simple-data-pipe	
•  hMps://github.com/ibm-cds-labs/pipes-connector-flightstats	
•  hMp://spark.apache.org/docs/latest/mllib-guide.html	
•  hMps://console.ng.bluemix.net/data/analy6cs/
©2016	IBM	Corpora6on		
	
Thank You

More Related Content

Spark tutorial pycon 2016 part 1