Introduction To Data Science
Introduction To Data Science
Building"Recommender"Systems"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#1$
201301"
Introduc@on"
Chapter"1"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#2$
Course"Chapters"
! Introduc/on$
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"Acquisi@on"
! Evalua@ng"Input"Data"
! Data"Transforma@on"
! Data"Analysis"and"Sta@s@cal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! Introduc@on"to"Apache"Mahout"
! Implemen@ng"Recommenders"with"Apache"Mahout"
! Experimenta@on"and"Evalua@on"
! Produc@on"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#3$
Chapter"Topics"
Introduc/on$
! About$this$course$
! About"Cloudera"
! Course"logis@cs"
! Introduc@ons"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#4$
Course"Objec@ves"
A8er$successfully$comple/ng$this$course,$you$will$be$able$to:$
!Describe$the$role$and$responsibili/es$of$a$data$scien/st$
!Explain$several$ways$in$which$data$scien/sts$create$value$for$their$
organiza/ons,$using$several$industries$as$examples$
!Locate$and$acquire$data$from$diverse$sources$
!Use$transforma/on$and$normaliza/on$techniques$
!Determine$the$most$appropriate$type$of$analysis$to$perform$for$a$given$
problem$
!Be$able$to$implement$an$automated$recommenda/on$system$
!Develop,$evaluate,$and$rene$scoring$systems$for$recommenders$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#5$
Course"Objec@ves"(contd)"
!Understand$important$considera/ons$for$working$at$scale$
!Iden/fy$meaningful,$ac/onable,$and$business#oriented$results$from$the$
analysis$performed$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#6$
Chapter"Topics"
Introduc/on$
! About"this"course"
! About$Cloudera$
! Course"logis@cs"
! Introduc@ons"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#7$
About"Cloudera"
!Cloudera$is$The$commercial$Hadoop$company$
!Founded$by$leading$experts$on$Hadoop$from$Facebook,$Google,$Oracle$
and$Yahoo$
!Provides$consul/ng$and$training$services$for$Hadoop$users$
!Sta$includes$commi]ers$to$virtually$all$Hadoop$projects$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#8$
Cloudera"So\ware"
!CDH$
A"set"of"easy/to/install"packages"built"from"the"Apache"Hadoop"core"
repository,"integrated"with"several"addi@onal"open"source"Hadoop"
ecosystem"projects"
Includes"a"stable"version"of"Hadoop,"plus"cri@cal"bug"xes"and"solid"new"
features"from"the"development"version"
100%"open"source"
!Cloudera$Manager,$Free$Edi/on$
The"easiest"way"to"deploy"a"Hadoop"cluster"
Automates"installa@on"of"Hadoop"so\ware"
Installa@on,"monitoring"and"congura@on"is"performed"from"a"central"
machine"
Manages"up"to"50"nodes"
Completely"free"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#9$
Cloudera"Enterprise"
!Cloudera$Enterprise$
Complete"package"of"so\ware"and"support"
Built"on"top"of"CDH"
Includes"full"version"of"Cloudera"Manager"
Install,"manage,"and"maintain"a"cluster"of"any"size"
LDAP"integra@on"
Includes"powerful"cluster"monitoring"and"audi@ng"tools"
Resource"consump@on"tracking"
Proac@ve"health"checks"
Aler@ng"
Congura@on"change"audit"trails"
And"more"
24"x"7"support"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#10$
Cloudera"Services"
!Provides$consultancy$services$to$many$key$users$of$Hadoop$
Including"AOL"Adver@sing,"Samsung,"Groupon,"NAVTEQ,"Trulia,"Tynt,"
RapLeaf,"Explorys"Medical"
!Solu/ons$Architects$are$experts$in$Hadoop$and$related$technologies$
Several"are"commi=ers"to"the"Apache"Hadoop"project"
!Provides$training$in$key$areas$of$Hadoop$administra/on$and$development$
Courses"include"System"Administrator"training,"Developer"training,"Hive"
and"Pig"training,"HBase"Training,"Essen@als"for"Managers"
Custom"course"development"available"
Both"public"and"on/site"training"available"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#11$
Chapter"Topics"
Introduc/on$
! About"this"course"
! About"Cloudera"
! Course$logis/cs$
! Introduc@ons"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#12$
Logis@cs"
!Course$start$and$end$/mes$
!Lunch$
!Breaks$
!Restrooms$
!Wi#Fi$access$
!Virtual$machines$
!Can$I$come$in$early/stay$late?$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#13$
Chapter"Topics"
Introduc/on$
! About"this"course"
! About"Cloudera"
! Course"logis@cs"
! Introduc/ons$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#14$
Introduc@ons"
!About$your$instructor$
!About$you$
Where"do"you"work"and"what"do"you"do"there?"
Do"you"have"a"scien@c"or"mathema@cal"background?"
What"programming"languages"have"you"used?"
Are"you"experienced"with"Apache"Hadoop"or"related"tools?"""
What"do"you"expect"to"gain"from"this"course?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
01#15$
Data"Science"Overview"
Chapter"2"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#1%
Course"Chapters"
! IntroducCon"
! Data%Science%Overview%
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiCon"
! EvaluaCng"Input"Data"
! Data"TransformaCon"
! Data"Analysis"and"StaCsCcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducCon"to"Apache"Mahout"
! ImplemenCng"Recommenders"with"Apache"Mahout"
! ExperimentaCon"and"EvaluaCon"
! ProducCon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#2%
Data"Science"Overview"
In%this%chapter%you%will%learn%
!What%data%science%is%
!Why%data%scien=sts%are%in%demand%
!What%data%products%they%help%to%create%
!Which%skills%a%successful%data%scien=st%must%possess%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#3%
Chapter"Topics"
Data%Science%Overview%
! What%is%data%science?%
! The"growing"need"for"data"science"
! The"role"of"a"data"scienCst"
! Review"quesCons"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#4%
What"is"a"Data"ScienCst?"
!Theres%no%single%standard%deni=on%
!Lets%look%at%how%some%in%the%industry%describe%the%role%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#5%
What"is"a"Data"ScienCst?"(contd)"
%%
They"are"half"hacker,"half"analyst,""
they"use"data"to"build"products"and"nd"insights."
"
Monica"RogaC"
Senior"Data"ScienCst"
LinkedIn"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#6%
What"is"a"Data"ScienCst?"(contd)"
%%
A"data"scienCst"can"nd"pa=erns"in"data"that""
you"havent"sent"them"to"nd"
"
Tom"Wheeler"
Senior"Vice"President"
ClickFox"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#7%
What"is"a"Data"ScienCst?"(contd)"
%%
Data"ScienCst"(n.):"Person"who"is"be=er"at"
staCsCcs"than"any"so_ware"engineer"and"be=er"at"
so_ware"engineering"than"any"staCsCcian."
"
Josh"Wills"
Sr."Director"of"Data"Science"
Cloudera"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#8%
Our"DeniCon"
Data%science%is%a%mul=disciplinary%eld%that%combines%skills%
in%soOware%engineering%and%sta=s=cs%with%domain%
experience%to%support%the%end#to#end%analysis%of%large%and%
diverse%data%sets,%ul=mately%uncovering%value%for%an%
organiza=on%and%then%communica=ng%it%to%stakeholders%as%
ac=onable%results.%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#9%
Data"Products"
!One%frequent%goal%of%data%science%is%to%create%data$products$
!Data%products%have%three%important%characteris=cs%
They"are"built"from"data"
The"very"act"of"using"them"generates"new"data"
This"new"data"can"be"used"to"improve"them"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#10%
Data"Product"Examples:"Google"
!Google%wasnt%the%rst%Web%search%engine%when%it%launched%
Yahoo!,"Lycos,"AltaVista"and"Excite"were"market"leaders"
But"search"results"were"clu=ered"and"o_en"irrelevant"
!But%Google%had%an%excep=onal%data%product%
The"PageRank"algorithm"produced"highly"relevant"results"
PageRank"allowed"Google"to"quickly"dominate"the"market"
!Google%extended%this%success%with%addi=onal%data%products%
AdSense"(targeted"ads"for"Web"properCes)"
AdWords"(targeted"ads"for"search"result"pages)"
Google"AnalyCcs"(service"for"analyzing"Web"site"trac)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#11%
Data"Product"Examples:"LinkedIn"
!LinkedIn%is%a%popular%social%network%for%business%
!It%was%instrumental%in%the%emergence%of%data%science%
!The%company%oers%several%data%products%
LinkedIn"Network"Updates"(e/mail"news"summary"of"your"connecCons)"
People"You"May"Know"(list"of"potenCal"connecCons)"
InMaps"(visualizaCon"of"your"connecCons)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#12%
Data"Product"Examples:"Amazon"
!Amazon%popularized%the%use%of%product%recommenda=ons%
Based"on"collaboraCve"ltering,"as"well"see"later"
!They%then%added%varia=ons%on%this%to%oer%even%more%insight%
What"items"do"customers"buy"a_er"viewing"this"item?"
Customers"who"bought"this"item"also"bought"
!Other%data%products%from%Amazon%include%
Find"visually"similar"items"
StaCsCcally"Improbable"Phrases"
Product"adverCsing"API"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#13%
Data"Product"Examples:"eBay"
!eBay%oers%data%products%to%users,%marketers,%and%developers%
Intelligent"spelling"correcCon"for"aucCon"Ctles"
Sorry"you"didnt"win,"here"are"aucCons"for"similar"items"
eBay"Market"Data"
Milo"Open"API"(real/Cme"local"product"availability)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#14%
Chapter"Topics"
Data%Science%Overview%
! What"is"data"science?"
! The%growing%need%for%data%science%
! The"role"of"a"data"scienCst"
! Review"quesCons"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#15%
Why"is"Data"Science"in"Demand?"
!The%term%data%scien=st%has%only%recently%become%popular%
!Is%this%a%new%eld%or%just%newly%important?%
Data"science"is"the"intersecCon"of"several"elds"
Each"of"those"elds"has"existed"for"years"
The"specic'combina-on"of"them"has"recently"become"important"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#16%
Why"is"Data"Science"in"Demand?"(contd)"
!Why%is%this%combina=on%important%now?%
!Data%drives%the%demand%for%data%scien=sts%
Were"generaCng"more"data"than"ever"
Were"generaCng"new"data"faster"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#17%
The"Data"Deluge"
!Were%genera=ng%more%data%than%ever%
Financial"transacCons"
Sensor"networks"
ApplicaCon"logs"
e/mail"and"text"messages"
Social"media"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#18%
The"Data"Deluge"(contd)"
!And%were%genera=ng%new%data%faster%than%ever%
AutomaCon"
Ubiquitous"Internet"connecCvity"
User/generated"content"
!For%example,%every%day%
Twi=er"processes"340"million"messages"
Trend"Micro"processes"six"TB"of"data"to"idenCfy"new"security"threats"
Facebook"users"generate"2.7"billion"comments"and"Likes"
The"Large"Hadron"Collider"produces"more"than"68"TB"of"data"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#19%
Cost"of"Large/Scale"Storage"
!Fortunately,%the%size%and%cost%of%storage%has%kept%pace%
Capacity"has"increased"while"price"has"decreased"
Year
Capacity (GB)
1992
0.08
$3,827.20
1997
2.10
$157.00
2002
80.00
$3.74
2007
750.00
$0.35
2012
3,000.00
$0.05
"
!We%can%now%aord%to%retain%what%we%previously%discarded%
Although"storage"is"important,"only"analysis"will"yield"value"
But"what"value"can"we"produce"with"all"of"this"data?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#20%
Value"of"Large/Scale"Analysis"
!This%data%has%many%valuable%applica=ons,%including%
Product"recommendaCons"
MarkeCng"analysis"
Demand"forecasCng"
Fraud"detecCon"
!The%more%data%we%have,%the%more%valuable%our%applica=ons%become%
!Moving%from%data%collec=on%to%data%product%requires%many%skills%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#21%
Chapter"Topics"
Data%Science%Overview%
! What"is"data"science?"
! The"growing"need"for"data"science"
! The%role%of%a%data%scien=st%
! Review"quesCons"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#22%
Data"Science"Skills"
!Data%scien=sts%have%an%unusual%combina=on%of%skills%
!An%eec=ve%data%scien=st%will%possess%
An"analyCcal"mindset"
A"broad"mathemaCcal"background"
Database"systems"experience"
Skill"with"so_ware"engineering,"parCcularly"at"scale"
Ability"to"communicate"eecCvely"with"both"business"and"technical"
audiences"
ExperCse"in"a"parCcular"industry"
!How%do%these%map%to%longstanding%roles%in%an%organiza=on?%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#23%
Data"ScienCsts"Are"Found"at"the"Conuence"
%
MathemaCcal"background"
Database"systems"experience"
AnalyCcal"mindset"
Statistician
%
So_ware"development"
Industry"experCse"
Software Engineer
Business Analyst
Business"focus"
Distributed"compuCng"
Process"automaCon"
Data Scientist
EecCve"communicaCon"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#24%
How"Does"a"Data"ScienCst"Dier"
!From%a%soOware%engineer?%
Data"scienCsts"dont"develop"general"purpose"so_ware"
They"develop"so_ware"to"help"them"solve"problems"
Data"scienCsts"tend"to"focus"more"on"scripCng"and"automaCon"
!From%a%data%analyst%or%sta=s=cian?%
These"roles"usually"rely"on"curated"data"and"predened"tools"
If"a"tool"is"lacking"or"data"is"incomplete,"quesCons"are"le_"unanswered"
Data"scienCsts"instead"collect"more"data"and"write"new"tools"
They"also"typically"work"with"huge,"disparate,"and"dirty"data"
!From%a%business%analyst?%
Both"have"business"focus,"so"they"may"ask"similar"quesCons"
Data"scienCst"has"technical"skills"to"nd"answers"without"help"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#25%
Data"ScienCsts"Within"OrganizaCons"
!Data%Scien=st%is%a%rela=vely%new%term%
First"used"in"a"NaConal"Science"Board"publicaCon"in"2005"
Popularized"a_er"teams"were"created"at"Facebook"and"LinkedIn"
!The%name%given%to%this%eld%oOen%varies%by%industry%
BioinformaCcs"
ComputaConal"social"science"
Research"or"staCsCcal"science"
PredicCve"modeling"
!Analysts%have%tradi=onally%been%internal%consultants%
Data"scienCsts"o_en"serve"a"similar"role"
!Some%organiza=ons%use%data%scien=sts%for%product%development%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#26%
Chapter"Topics"
Data%Science%Overview%
! What"is"data"science?"
! The"growing"need"for"data"science"
! The"role"of"a"data"scienCst"
! Review%ques=ons%
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#27%
Review"QuesCons"
!How%would%you%dene%data%science?%
!What%other%examples%of%data%products%have%you%seen?%
!Imagine%that%you%work%for%a%university.%%What%data%products%might%you%
create?%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#28%
Chapter"Topics"
Data%Science%Overview%
! What"is"data"science?"
! The"growing"need"for"data"science"
! The"role"of"a"data"scienCst"
! Review"quesCons"
! Essen=al%points%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#29%
EssenCal"Points"
!Data%science%is%the%combina=on%of%several%skills,%including%mathema=cs,%
soOware%engineering,%communica=ons,%and%domain%experience%
!Data%products%are%built%from%data%and%produce%new%data%as%theyre%used,%
allowing%them%to%be%further%improved%
!Dening%dierences%between%data%analysts%and%data%scien=sts%are%the%
laaers%ability%to%work%with%massive%data%sets%and%to%develop%new%tools%to%
collect,%transform%and%analyze%data%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#30%
Chapter"Topics"
Data%Science%Overview%
! What"is"data"science?"
! The"growing"need"for"data"science"
! The"role"of"a"data"scienCst"
! Review"quesCons"
! EssenCal"points"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#31%
Data"Science"Overview"
In%this%chapter%you%have%learned%
!What%data%science%is%
!What%data%products%they%help%to%create%
!Why%data%scien=sts%are%in%demand%
!Which%skills%a%successful%data%scien=st%must%possess%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#32%
Bibliography"
The%following%oer%more%informa=on%on%topics%discussed%in%this%chapter%
!The%Rise%of%the%Data%Scien=st%
http://tiny.cloudera.com/dscc02a
!Building%Data%Science%Teams%
http://tiny.cloudera.com/dscc02b
!Data%Scien=st:%The%Sexiest%Job%of%the%21st%Century%
http://tiny.cloudera.com/dscc02c
%%
!Quora:%How%do%I%become%a%data%scien=st?%
http://tiny.cloudera.com/dscc02d
!Seismic%Data%Science%
http://tiny.cloudera.com/dscc02e
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
02#33%
Use"Cases"
Chapter"3"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#1%
Course"Chapters"
! IntroducBon"
! Data"Science"Overview"
! Use%Cases%
! Project"Lifecycle"
! Data"AcquisiBon"
! EvaluaBng"Input"Data"
! Data"TransformaBon"
! Data"Analysis"and"StaBsBcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducBon"to"Apache"Mahout"
! ImplemenBng"Recommenders"with"Apache"Mahout"
! ExperimentaBon"and"EvaluaBon"
! ProducBon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#2%
Use"Cases"
In%this%chapter%you%will%learn%
!How%data%science%is%being%applied%in%dierent%industries%
!How%data%science%is%used%to%increase%revenue%
!How%data%science%is%used%to%reduce%costs%
!How%data%science%is%used%to%save%lives%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#3%
Chapter"Topics"
Use%Cases%
! Finance%
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#4%
Finance"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#5%
Finance:"Fraud"DetecBon"
!There%are%many%kinds%of%nancial%fraud,%including%
SecuriBes"trading"
Insurance"
Credit"card"
!Credit%card%fraud%detecIon%is%perhaps%most%widely%known%%
Please"conrm"unusual"acBvity"weve"detected"on"your"account"
!The%specic%techniques%dier%and%are%proprietary%
They"usually"have"one"thing"in"common:"access"to"lots"of"data"
!SoluIon%overview%
Log"data"from"mulBple"sources"to"HDFS,"perhaps"via"Flume"
Use"machine"learning"to"classify"typical"customer"behavior""
And"to"idenBfy"deviant"behavior"worthy"of"further"invesBgaBon"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#6%
Finance"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#7%
Finance:"Customer"Risk"Analysis"
!Its%more%protable%to%price%according%to%risk%than%to%deny%a%sale%
Higher/risk"customers"pay"more"for"products"
Examples"include"auto"loans,"credit"cards,"and"insurance"policies"
!An%accurate%risk%assessment%relies%on%disparate%data%
And"lots"of"it!"
Past"purchases,"payment"history,"clickstream"data,"call"logs,"etc."
!SoluIon%overview%
Ingest"data"from"many"sources"into"Hadoop"storage"(HDFS)"
Correct"data"errors"and"use"consistent"representaBon"of"a=ributes"
Merge"data"into"a"single"view"of"the"customer"
Create"iniBal"risk"model"based"on"historic"data"
ConBnually"rene"the"model"based"on"actual"risk"encountered"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#8%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail%
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#9%
Retail"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#10%
Product"RecommendaBons"
!Many%sites%use%collaboraIve%ltering%(CF)%to%suggest%products%
Increases"sales,"a=racts"customers"and"improves"saBsfacBon"
Well/known"examples"include"Amazon,"Nealix,"and"Last.fm"
!We%are%going%to%study%this%in%depth%during%the%course%
CollaboraBve"ltering"has"many"pracBcal"applicaBons"
!SoluIon%overview%
Capture"customer"acBvity"and"preference"data"
Find"other"customers"with"similar"preferences"
Determine"which"products"these"customers"rated"highly"
Recommend"these"products"to"the"customer"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#11%
Retail"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#12%
Predict"and"IncenBvize"Purchases"
!Instead%of%simply%recommending%products,%moIvate%purchases%
Retailers"can"increase"sales"using"highly/targeted"oers"
!Could%use%a%variety%of%historic%data%for%customers,%including%
Products"customer"has"purchased"
Which"products"customer"later"returned"or"exchanged"
!SoluIon%overview%
Purchase"history"is"an"implicit"source"of"preference"data"
Buying"something"is"an"upvote"
Returning"or"exchanging"something"is"a"downvote"
Use"CF"to"predict"future"purchases"based"on"similar"customers"
Oer"customer"coupons"to"incenBvize"these"purchases"
Use"redempBon"data"to"improve"future"predicBons"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#13%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverIsing%
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#14%
AdverBsing"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#15%
Serving"More"EecBve"Web"Ads"
!AdverIsing%is%a%signicant%source%of%revenue%for%Web%properIes%
Pay/per/click"is"much"more"valuable"than"pay/per/impression"
You"can"therefore"boost"revenue"by"increasing"clickthrough"rate"
!Machine%learning%techniques%are%common%for%ad%click%predicIon%
!Data%typically%analyzed%includes%
Relevance"of"ad"to"the"search"query"or"site"visited"
Overall"historical"eecBveness"of"a"given"ad"
Time"of"day"and"day"of"the"week"
!However,%all%of%these%factors%are%independent%of%the%user%
They"essenBally"predict"what"an"average"person"would"click"
CollaboraBve"ltering"can"help"predict"a"specic"persons"clicks"
This"technique"increased"Yahoos"predicBon"accuracy">"10%"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#16%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense%and%Intelligence%
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#17%
Defense"and"Intelligence"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#18%
ForecasBng"Insurgent"A=acks"
!PredicIng%insurgent%aYacks%is%extremely%dicult%
Insurgents"tend"to"be"loosely"organized"and"informal"
They"blend"in"with"the"local"non/combatant"populaBon"
DetecBng"their"movements"and"acBviBes"is"a"challenge"
!Past%aYack%data%can%reveal%paYerns%that%predict%future%events%
!A%team%of%researchers%demonstrated%how%eecIve%this%can%be%
Analyzed"Afghan"War"Diary"documents"from"WikiLeaks"
Used"data"from"2004/2009"events"to"train"a"computer"model"
Forward"predicBon"of"2010"events"was"surprisingly"accurate"
Expected"a"128%"increase"in"events"for"Baghlan"province"
A"120%"increase"was"actually"observed"in"the"2010"data"
!Could%be%used%to%provide%decision%support%for%mission%planning%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#19%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaIons%and%UIliIes%
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#20%
TelecommunicaBons"and"UBliBes"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#21%
PredicBng"Equipment"Failure"
!Aging%power,%water,%and%telecom%infrastructure%in%many%locaIons%%
Too"expensive"to"replace"components"that"sBll"work"
Cost"of"failure"can"be"even"more"expensive"
!SoluIon%overview%
Install"sensors"to"gather"data"(vibraBon,"temperature,"etc.)"
Store"this"data"in"Hadoop"(using"Flume)"
Analyze"to"determine"leading"indicators"of"failure"
Use"machine"learning"to"classify"components"likely"to"fail"soon"
Replace"with"new"components"proacBvely"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#22%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare%and%PharmaceuIcals%
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#23%
Healthcare"and"PharmaceuBcal"
%%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#24%
More"Accurate"MedicaBon"ReconciliaBon"
!A%physician%needs%to%know%what%medicine%the%paIent%uses%
Drugs"can"negate"one"another"or"cause"deadly"interacBons"
!Unfortunately,%this%data%is%not%always%accurate%
Some"paBents"may"not"recall"all"medicaBons"when"asked"
Others"might"intenBonally"omit"others"(e.g."illegal"drugs)"
A"2007"study"documented"discrepancies"in"80.4%"of"paBents"
!Overall,%quite%similar%to%product%recommendaIon%use%case%
Recommenders"can"be"agnosBc"to"whats"being"evaluated"
Customer"="paBent""
Product"="medicaBon"
Result"is"a"list"of"medicaBons"that"similar"paBents"report"using"
Physician"can"then"specically"ask"about"/"test"for"use"of"these"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#25%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review%quesIons%
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#26%
Review"QuesBons"
!What%are%some%common%themes%youve%observed%in%these%use%cases?%
!Imagine%youve%been%hired%by%an%airline%to%retain%its%most%protable%
customers%%
What"quesBons"would"you"want"to"be"able"to"answer?"
What"data"would"you"need"to"do"this?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#27%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenIal%points%
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#28%
EssenBal"Points"
!Data%Science%is%applicable%to%many%industries%
Although"industries"may"dier,"techniques"are"onen"quite"similar""
!Key%themes%include%
Access"to"large"amounts"of"data"
AcquisiBon"of"several"types"of"data"from"dierent"sources"
The"ability"to"analyze"this"data"at"scale"to"nd"interesBng"pa=erns"
Problem"solving"focuses"on"a"specic"and"acBonable"result"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#29%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"Points"
! Hands#On%Exercise:%Parsing%Log%Data%with%Python%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#30%
Hands/on"Exercise:"Parsing"Log"Data"with"Python"
!In%this%Hands#On%Exercise,%you%will%learn%the%basics%of%Python%and%the%
command%line%uIlity%for%accessing%Hadoops%Distributed%Filesystem%
(HDFS).%%
Youll"use"regular"expressions"in"Python"to"parse"a"Web"server"log"le""
an"important"source"of"informaBon"for"data"science"projects"
Please"refer"to"the"Hands/On"Exercise"Manual"for"instrucBons"on"
exercise"#0"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#31%
Chapter"Topics"
Use%Cases%
! Finance"
! Retail"
! AdverBsing"
! Defense"and"Intelligence"
! TelecommunicaBons"and"UBliBes"
! Healthcare"and"PharmaceuBcals"
! Review"quesBons"
! EssenBal"points"
! Hands/On"Exercise:"Parsing"Log"Data"with"Python"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#32%
Use"Case"
In%this%chapter%you%have%learned%
!How%data%science%is%being%applied%in%dierent%industries%
!How%data%science%is%used%to%increase%revenue%
!How%data%science%is%used%to%reduce%costs%
!How%data%science%is%used%to%save%lives%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#33%
Bibliography"
The%following%oer%more%informaIon%on%topics%discussed%in%this%chapter%
!Fraudsters,%Outliers,%and%Big%Data%
http://tiny.cloudera.com/dscc03a
!Learning%Causality%for%News%Event%PredicIon%
http://tiny.cloudera.com/dscc03b
!Personalized%PredicIon%for%Sponsored%Search%
http://tiny.cloudera.com/dscc03c
!Mining%Event%Logs%in%Large%Scale%Systems%
http://tiny.cloudera.com/dscc03d
!MedicaIon%ReconciliaIon%
http://tiny.cloudera.com/dscc03e
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
03#34%
Project"Lifecycle"
Chapter"4"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#1%
Course"Chapters"
! IntroducEon"
! Data"Science"Overview"
! Use"Cases"
! Project%Lifecycle%
! Data"AcquisiEon"
! EvaluaEng"Input"Data"
! Data"TransformaEon"
! Data"Analysis"and"StaEsEcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducEon"to"Apache"Mahout"
! ImplemenEng"Recommenders"with"Apache"Mahout"
! ExperimentaEon"and"EvaluaEon"
! ProducEon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#2%
Project"Lifecycle"
In%this%chapter%you%will%learn%
!How%a%data%scien>st%approaches%a%problem%
!Which%steps%comprise%the%lifecycle%of%a%data%science%problem%
!About%the%data%science%problem%well%address%through%hands#on%exercises%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#3%
Chapter"Topics"
Project%Lifecycle%
! Steps%in%the%project%lifecycle%
! Hands/On"Exercises:"scenario"explanaEon"
! Review"quesEons"
! EssenEal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#4%
Overview"of"the"Project"Lifecycle"
!A%typical%data%science%project%should%follow%these%steps%
Dene"a"problem"
IdenEfy"the"desired"outcome"
Determine"which"data"is"needed"
Evaluate"possible"soluEons"
Measure"eecEveness"
Make"improvements"
Communicate"results"
!Well%discuss%each%of%these%in%a%moment%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#5%
The"EnEre"Project"Lifecycle"is"IteraEve"
%%
Define a
Problem
Identify
Desired
Outcome
Determine
Which Data
is Required
Communicate
Results
Evaluate
Possible
Solutions
Make
Improvements
Measure
Effectiveness
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#6%
Parts"of"the"Project"Lifecycle"are"Also"IteraEve"
Define a
Problem
Identify
Desired
Outcome
Determine
Which Data
is Required
Communicate
Results
Evaluate
Possible
Solutions
Make
Improvements
Measure
Effectiveness
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#7%
Scale"Aects"the"SoluEon"
!Scale%is%central%to%the%itera>ve%approach%
Validate"an"approach"with"a"small"sample"of"data"
Test"implementaEon"with"a"moderate"amount"
Use"a"large"amount"of"data"to"rene"a"proven"soluEon"
!Implementa>ons%and%even%algorithms%themselves%may%hit%limits%
!A%given%algorithm%may%work%ne%on%25MB%of%data%
But"may"require"changes"in"implementaEon"to"work"with"25GB"
And"may"be"impossible"to"scale"to"handle"25TB"
!Each%itera>on%may%involve%a%change%to%the%amount%of%data%used%
And"to"the"code"you"use"to"analyze"it"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#8%
Dening"a"Problem"
!The%process%begins%by%rst%clearly%sta>ng%a%problem%
!This%is%oTen%directly%related%to%revenue%
People"browse"our"site"but"dont"buy"anything"
Too"many"customers"abandon"their"shopping"carts"
Subscribers"arent"renewing"their"service"
Sponsors"are"donaEng"less"than"ever"before"
!In%other%cases,%the%problem%is%related%to%costs%
Our"employees"spend"too"much"Eme"searching"for"documents"
An"increase"in"support"calls"cost"us"$400,000"last"year"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#9%
Dening"a"Problem"(contd)"
!Some>mes%the%approach%is%more%exploratory%
What"can"I"learn"about"our"users"from"this"clickstream"data?"
Why"are"customers"abandoning"purchases"before"checkout?"
How"many"more"nish"checkout"when"oered"free"shipping?"
Would"oering"free"shipping"on"all"orders"increase"prots?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#10%
IdenEfying"the"Desired"Outcome"
!Given%the%problem,%whats%the%preferred%resolu>on?%
!Again,%these%are%oTen%>ed%to%revenue%or%costs%
Increase"subscripEon"renewals"by"5%"within"two"months"
Decrease"shopping"cart"abandonment"by"10%"in"Q3"
Reduce"support"call"volume"by"25%"within"one"year"
!Be%careful%what%you%wish%for%
Problem:"An"increase"in"support"calls"cost"$400,000"last"year"
Goal:"Reduce"support"call"volume"by"25%"within"one"year"
A"reducEon"in"support"calls"may"not"mean"fewer"problems"
Could"indicate"customer"frustraEon"due"to"poor"support"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#11%
Determining"Which"Data"is"Needed"
!What%data%must%you%capture%to%solve%the%problem%youve%dened?%
And"to"determine"if"your"soluEon"meets"the"goals"idenEed"
!Further%renements%may%require%addi>onal%data%
!Consider%the%source,%format,%and%quality%of%this%data%
Does"your"organizaEon"have"everything"you"need?"
If"not,"is"it"available"from"external"sources?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#12%
Evaluate"Possible"SoluEons"
!Consider%all%solu>ons%that%could%match%desired%outcome%
!This%typically%involves%a%hypothesis%about%the%root%cause%
What"prompted"the"recent"increase"in"support"calls?"
Why"are"customers"abandoning"their"carts?"
What"causes"customers"to"not"renew"subscripEons?"
!Given%several%possible%solu>ons,%which%should%you%invest%in?%
Most"can"be"discounted"quickly"
Small/scale"tests"can"help"you"choose"
!The%simplest%solu>on%is%usually%the%best%one%to%pursue%rst%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#13%
Measuring"EecEveness"
!Measuring%eec>veness%requires%two%things%
Metrics:"properEes"to"measure"
Method:"a"process"for"comparing"these"metrics"
! Key%point:%have%you%achieved%the%desired%outcome?%
Focus"on"measuring"what"really"ma=ers"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#14%
Making"Improvements"
!Measurement%will%illustrate%how%much%improvement%is%required%
!Consider%what%you%might%change%
Was"your"hypothesis"about"the"root"cause"correct?"
Could"adding"an"addiEonal"data"set"give"more"insight?"
Should"you"try"one"of"the"soluEons"you"originally"discarded?"
!Once%youve%implemented%your%improvements,%test%it%again%
Comparing"measurements"can"help"to"rene"your"soluEon"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#15%
CommunicaEng"Results"
!Communica>on%is%an%essen>al%part%of%data%science%
!A%data%scien>st%must%tell%the%story%found%within%the%data%
Like"any"good"story,"it"must"be"compelling"
Be"concise"and"focus"on"whats"important"for"the"audience"
!Dashboards%are%a%common%tool%for%communica>ng%results%
StaEsEcs"
Summaries"
VisualizaEons"
https://dashboard.example.com/
Revenue by Period
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#16%
Chapter"Topics"
Project%Lifecycle%
! Steps"in"the"project"lifecycle"
! Hands#On%Exercises:%scenario%explana>on%
! Review"quesEons"
! EssenEal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#17%
Hands/On"Exercises:"Scenario"ExplanaEon"
!Well%prac>ce%what%weve%learned%with%hands#on%exercises%
We"will"work"through"a"complete"data"science"problem"during"the"
course"
!Cloudera%Movies%is%a%successful%online%movie%streaming%service%
Our"14"million"customers"pay"a"monthly"subscripEon"fee"
!Unfortunately,%our%success%has%started%to%wane%
Revenues"dropped"last"quarter"by"11%"
ProjecEons"show"revenues"decreasing"this"quarter"by"17%"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#18%
Hands/On"Exercises:"Scenario"ExplanaEon"(contd)"
!Why%are%revenues%down?%%
ExisEng"customers"arent"renewing"their"subscripEons"
ProspecEve"customers"arent"joining"our"service"
!Customers%were%surveyed%when%they%called%to%cancel%
Reason"cited"by"79%"of"customers"
You"have"lots"of"movies,"just"none"that"interest"me."
!Cloudera%Movies%has%hired%you%to%help%solve%this%problem%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#19%
Hands/On"Exercises:"Scenario"ExplanaEon"(contd)"
!Problem%Deni>on%%
Many"customers"are"choosing"not"to"renew"their"subscripEons"
!Desired%Outcome%
Decrease"cancellaEon"rate"by"35%"during"next"quarter"
!Possible%solu>ons%
Decrease"subscripEon"cost"(discard:"price"is"not"the"problem)"
Social"media"integraEon"(discard:"may"violate"privacy"laws)"
Improve"movie"recommendaEons"(well"pursue"this"one)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#20%
Hands/On"Exercises:"Overview"
!There%are%many%prerequisites%to%building%a%recommender%system%
Acquiring"the"input"data"from"various"sources"
IdenEfying"and"correcEng"errors"in"the"input"data"
Transforming"the"data"into"the"desired"format"for"analysis"
!The%work%isnt%nished%even%aTer%youve%built%the%recommender%
You"need"to"test"it"
Youll"likely"nd"ways"to"improve"it"
These"may"require"addiEonal"data"sources"
The"project"lifecycle"begins"again"
!Heres%an%overview%of%hands#on%exercises%youll%complete%in%this%course%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#21%
Hands/On"Exercises:"Overview"(contd)"
!Lab%#0:%Tool%Basics%
Youll"become"acquainted"with"the"HDFS"command"line"uElity"and"
Python,"two"tools"well"use"extensively"in"later"labs.""Youll"explore"the"
use"of"Pythons"regular"expressions"for"parsing"a"Web"server"log"le,"a"
valuable"source"of"informaEon"for"data"scienEsts."
!Lab%#1:%Data%Import%
Youll"acquire"data"from"dierent"sources"that"will"be"used"to"build"our"
recommender"system"
!Lab%#2:%Evalua>ng%Input%Data%
Youll"idenEfy"errors"in"the"input"data"and"correct"them"using"Hadoop"
MapReduce"jobs"wri=en"in"Python"
!Lab%#3:%Data%Transforma>on%
Youll"use"Apache"Hive"to"lter"and"join"input"data"from"earlier"labs"into"
a"single"object"that"represents"all"informaEon"about"each"customer"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#22%
Hands/On"Exercises:"Overview"(contd)"
!Lab%#4:%Data%Analysis%
Youll"learn"the"basics"of"the"R"staEsEcal"language,"then"use"it"to"
analyze"social"media"data"that"will"later"be"used"to"test"an"improvement"
to"our"recommender"
!Lab%#5:%Basic%Recommender%
Youll"build"a"simple"user/based"recommender"system"in"Python"
!Lab%#6:%Building%a%Recommender%with%Mahout%
Youll"use"Apache"Mahout"to"build"a"scalable"item/based"recommender"
system"using"data"youve"acquired"in"earlier"labs"
!Lab%#7:%Analyzing%Your%Recommender%
Youll"evaluate"data"gathered"from"two"versions"of"the"recommender"
system"in"order"to"determine"which"is"more"eecEve"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#23%
Chapter"Topics"
Project%Lifecycle%
! Steps"in"the"project"lifecycle"
! Hands/On"Exercises:"scenario"explanaEon"
! Review%ques>ons%
! EssenEal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#24%
Review"QuesEons"
!Imagine%that%youre%a%data%scien>st%working%for%an%online%brokerage%rm.%%
What%metrics%do%you%think%would%be%important?%
!Lets%say%that%a%major%electronics%retailer%has%hired%you%to%decrease%the%
bounce%rate%on%their%Web%site.%%What%kinds%of%data%would%be%most%helpful%
to%analyze?%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#25%
Chapter"Topics"
Project%Lifecycle%
! Steps"in"the"project"lifecycle"
! Hands/On"Exercises:"scenario"explanaEon"
! Review"quesEons"
! Essen>al%points%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#26%
EssenEal"Points"
!The%lifecycle%of%a%data%science%project%is%itera>ve%
!Its%important%to%clearly%state%a%problem%
!The%problem%helps%to%establish%the%desired%outcome%
!Success%is%determined%by%measuring%results%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#27%
Chapter"Topics"
Project%Lifecycle%
! Steps"in"the"project"lifecycle"
! Hands/On"Exercises:"scenario"explanaEon"
! Review"quesEons"
! EssenEal"points"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#28%
Project"Lifecycle"
In%this%chapter%you%have%learned%
!How%a%data%scien>st%approaches%a%problem%
!Which%steps%comprise%lifecycle%of%a%data%science%problem%
!About%the%data%science%problem%well%address%through%hands#on%exercises%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#29%
Bibliography"
The%following%oer%more%informa>on%on%topics%discussed%in%this%chapter%
!A%Taxonomy%of%Data%Science%
http://tiny.cloudera.com/dscc04a
!Sefng%Expecta>ons%in%Data%Science%Projects%
http://tiny.cloudera.com/dscc04b
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
04#30%
Data"AcquisiAon"
Chapter"5"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#1%
Course"Chapters"
! IntroducAon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data%Acquisi0on%
! EvaluaAng"Input"Data"
! Data"TransformaAon"
! Data"Analysis"and"StaAsAcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducAon"to"Apache"Mahout"
! ImplemenAng"Recommenders"with"Apache"Mahout"
! ExperimentaAon"and"EvaluaAon"
! ProducAon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#2%
Data"AcquisiAon"
In%this%chapter%you%will%learn%
!What%types%of%data%are%used%in%analysis%
!Where%you%can%nd%these%data%sets%
!What%are%some%common%methods%of%accessing%this%data%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#3%
Chapter"Topics"
Data%Acquisi0on%
! Where%to%source%data%
! AcquisiAon"techniques"
! Review"quesAons"
! EssenAal"points"
! Hands/On"Exercise:"Acquiring"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#4%
Data"Quality"and"Provenance"
!These%are%important%issues%to%consider%prior%to%acquisi0on%
!Establish%the%trustworthiness%of%the%original%source%
Is"the"organizaAon"collecAng"the"data"reputable?"
Can"I"trust"that"the"data"itself"is"accurate?"
!If%acquired%through%a%third%party,%determine%lineage%
Did"they"get"the"data"directly"from"the"original"source?"
Was"the"data"modied"from"the"original"source?"
!Repeatable%results%maHer%
Use"a"source"control"system"for"your"code"
Keep"track"of"all"changes"to"your"data"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#5%
Internal"Data"Sources"
!Most%valuable%informa0on%comes%from%your%own%organiza0on%
!There%are%many%sources%of%data%available%internally%
ApplicaAon"databases"(OLTP)"
Data"warehouses"(OLAP)"
Log"les"(Web,"e/mail,"and"other"applicaAons)"
Documents"(le"servers,"intranet,"Web"site)"
Sensors"and"network"events"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#6%
Freely"Available"Data"Sources"
!External%data%is%oOen%used%to%augment%a%solu0on%
GeolocaAon"for"IP"addresses"in"Web"server"logs"
Demographic"informaAon"about"those"locaAons"
!There%are%many%sources%of%data%available%at%no%cost%
Some"are"public"domain"and"some"are"copyrighted""
Be"sure"to"check"the"license"to"verify"that"your"use"is"allowed"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#7%
Freely"Available"Data"Sources"(contd)"
U.S."Census"Bureau"
http://factfinder2.census.gov/
U.S."ExecuAve"Branch"
http://www.data.gov/
U.K."Government"
http://data.gov.uk/
E.U."Government"
http://publicdata.eu/
The"World"Bank"
http://data.worldbank.org/
Freebase"
http://www.freebase.com/
Wikidata"
http://meta.wikimedia.org/wiki/Wikidata
Amazon"Web"Services"
http://aws.amazon.com/datasets
InfoChimps"*"
http://www.infochimps.com/marketplace
*"Most"data"sets"are"available"at"no"cost,"but"some"have"a"fee"
"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#8%
Commercial"Data"Sources"
!Many%companies%also%oer%data%
Usually"for"a"fee,"but"someAmes"available"at"no"cost"
Always"be"sure"to"check"the"license"terms"
Gnip"
Social"Media"
http://gnip.com/
AC"Nielsen"
Media"Usage"
http://www.nielsen.com/
Rapleaf"
Demographic"
http://www.rapleaf.com/
ESRI"
Geographic"(GIS)"
http://www.esri.com/
eBay"
AucAon"
https://developer.ebay.com/
D&B"
Business"EnAty"
http://www.dnb.com/
Trulia"
Real"Estate"
http://www.trulia.com/
Standard"&"Poors"
Financial"
http://standardandpoors.com/
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#9%
Chapter"Topics"
Data%Acquisi0on%
! Where"to"source"data"
! Acquisi0on%techniques%
! Review"quesAons"
! EssenAal"points"
! Hands/On"Exercise:"Acquiring"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#10%
Database"IntegraAon"
!Data%internal%to%an%organiza0on%is%oOen%kept%in%a%database%
!To%access%small%samples,%just%export%a%subset%to%a%local%le%
Can"do"this"programmaAcally"or"manually"via"query"tool"
Can"also"do"this"on"command"line,"as"shown"below"
"
$ cat 10k_customers.sql
select id, firstname, lastname, email, zipcode
" outfile '/user/jsmith/cust10k.csv'
into
fields
terminated by ','
"
optionally enclosed by '"'
escaped by '\\'
lines terminated by '\n'
from customers limit 10000"
InvocaAon"details"will"vary"depending"on"database"used"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#11%
Database"IntegraAon"(contd)"
!This%approach%isnt%appropriate%for%large%data%sets%
HDFS"(Hadoop)"is"a"be=er"choice"once"you"reach"terabyte"range"
HDFS"is"scalable,"resilient,"and"oers"high"performance"I/O"
!Sqoop%exchanges%data%between%an%RDBMS%and%HDFS%
Import"all"tables,"a"single"table,"or"a"parAal"table"into"HDFS""
Data"can"be"imported"in"delimited"text"or"Avro"le"format"
Sqoop"can"also"export"data"from"HDFS"to"a"database"
!Sqoop%is%compa0ble%with%almost%any%database%
Some"also"support"high/performance"custom"connectors"
!Sqoop%supports%incremental%imports%
Can"bring"in"all"exisAng"data"into"HDFS"during"iniAal"import"
Then"import"just"new"records"in"subsequent"imports"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#12%
Other"Internal"Sources"
!Systems%that%produce%data%in%the%form%of%les%are%easily%handled%
For"a"few"small"les,"just"copy"them"to"a"local"lesystem"
Larger"le"sets"should"be"copied"to"HDFS"instead"
"
$ hadoop fs -put myfile.txt /bigdata/project/myfile.txt
"
This"can"be"done"manually"or"as"part"of"a"script"
HDFS"also"supports"a"REST"API"through"WebHDFS"
!Alterna0ve:%Use%Flume%to%add%data%to%HDFS%as%its%generated%%
Can"tail"log"les"to"capture"lines"as"soon"as"theyre"wri=en"
Other"sources:"program"execuAon,"network"port,"and"syslog"
Write"custom"sources"to"integrate"with"legacy"systems"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#13%
Data"Archive"Downloads"
!External%data%sources%are%some0mes%in%the%form%of%archives%
Delimited"and"xed/width"texhiles"are"most"common"type"
Usually"compressed"to"save"storage"space"and"bandwidth"
!These%are%usually%hosted%on%Web%or%FTP%sites%
Downloading"is"easy"with"your"browser"for"small"number"of"les"
!How%do%you%automate%download%of%many%les?%
Use"the"curl"or"wget"command"line"uAliAes"
$ curl i list_of_urls.txt
$ curl -O http://www.example.com/xyz[001-999].zip
$ curl -u jsmith:mysecret ftp://ftp.example.com/archive/bigfile.gz
$ wget --mirror http://www.example.com/data/ -o /home/jsmith
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#14%
Data"APIs"
!Many%organiza0ons%oer%data%as%services%rather%than%downloads%
Some"APIs"are"read/only,"while"others"support"data"modicaAon"
AuthenAcaAon"is"oien"required"(register"for"an"ID"to"use"in"calls)"
!Data%APIs%have%several%advantages%over%archive%downloads%
The"service"maintains"the"data"and"can"keep"it"updated"
Usually"cross/plahorm"and"cross/language"(REST"or"SOAP)"
Price"may"be"based"on"only"what"you"use"
!Access%to%data%through%APIs%also%has%disadvantages%
Price"or"terms"of"service"may"change"
Your"applicaAons"availability"depends"on"service"availability"
!Data%returned%by%an%API%is%typically%in%XML%or%JSON%format%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#15%
Data"API"Example"
!Heres%an%example%request%to%the%TwiHer%API%using%curl%
$ curl https://api.twitter.com/1/users/show.json?screen_name=cloudera
!Heres%the%JSON%response%(formaHed%and%excerpted)%
{
"id":16134540,
"name":"Cloudera",
"screen_name":"cloudera",
"location":"Palo Alto, CA",
"url":"http:\/\/www.cloudera.com\/",
"description":"Cloudera is the leading provider of Apache
Hadoop-based software and services.",
"followers_count":11359,
"created_at":"Thu Sep 04 20:10:22 +0000 2008",
"time_zone":"Pacific Time (US & Canada)",
}
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#16%
Screen"Scraping"
!Some0mes%the%data%is%only%available%within%a%Web%site%
You"dont"have"access"to"the"database"powering"the"site"
You"only"have"access"to"the"rendered"pages"themselves"
!You%can%acquire%the%data%by%screen%scraping%
ProgrammaAc"access"and"parsing"of"page"content"
Fragile:"your"script"may"break"when"page"changes"
Should"be"viewed"as"a"last"resort"
$ cat scraper.py
import urllib
from BeautifulSoup import BeautifulSoup
txt = urllib.urlopen("http://www.example.com/")
soup = BeautifulSoup(txt)
headings = soup.findAll("h2")
for heading in headings:
print heading.string
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#17%
Chapter"Topics"
Data%Acquisi0on%
! Where"to"source"data"
! AcquisiAon"techniques"
! Review%ques0ons%
! EssenAal"points"
! Hands/On"Exercise:"Acquiring"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#18%
Review"QuesAons"
!How%do%you%currently%track%changes%to%your%data?%
!Where%would%you%look%for%informa0on%on%languages%spoken%by%residents%
of%a%given%ZIP%code?%
!Imagine%that%you%work%for%an%e#Commerce%company%
How"would"you"fetch"the"rst"1000"lines"of"log"data"from"your"Web"
servers"each"day?"""
How"would"this"approach"change"as"trac"increased"and"you"wanted"to"
analyze"5"TB"of"data"each"day?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#19%
Chapter"Topics"
Data%Acquisi0on%
! Where"to"source"data"
! AcquisiAon"techniques"
! Review"quesAons"
! Essen0al%points%
! Hands/On"Exercise:"Acquiring"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#20%
EssenAal"Points"
!The%most%valuable%informa0on%is%found%within%your%organiza0on%
!Theres%a%variety%of%data%available%from%external%sources%that%can%help%
augment%your%solu0on%
!External%data%is%usually%accessed%as%an%archive%or%via%an%API%
Screen"scraping"is"another"opAon,"but"best"avoided"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#21%
Chapter"Topics"
Data%Acquisi0on%
! Where"to"source"data"
! AcquisiAon"techniques"
! Review"quesAons"
! EssenAal"points"
! Hands#On%Exercise:%Acquiring%Data%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#22%
Hands/on"Exercise:"Acquiring"Data"
!In%this%Hands#On%Exercise,%you%will%gain%prac0ce%acquiring%data%using%
several%of%the%methods%weve%discussed%
The"result"will"be"examined,"processed,"and"analyzed"in"upcoming"labs"
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc0ons%on%exercise%
#1%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#23%
Chapter"Topics"
Data%Acquisi0on%
! Where"to"source"data"
! AcquisiAon"techniques"
! Review"quesAons"
! EssenAal"points"
! Hands/On"Exercise:"Acquiring"Data"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#24%
Data"AcquisiAon"
In%this%chapter%you%have%learned%
!What%types%of%data%are%used%in%analysis%
!Where%you%can%nd%these%data%sets%
!What%are%some%common%methods%of%accessing%this%data%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#25%
Bibliography"
The%following%oer%more%informa0on%on%topics%discussed%in%this%chapter%
!Data%Provenance:%Some%Basic%Issues%
http://tiny.cloudera.com/dscc05a
!Data%Integra0on:%A%Theore0cal%Perspec0ve%
http://tiny.cloudera.com/dscc05b
!Crea0ng%a%Bioinforma0cs%Na0on%%
http://tiny.cloudera.com/dscc05c
!Programmable%Webs%Data%API%Index%
http://tiny.cloudera.com/dscc05d
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
05#26%
Evalua@ng"Input"Data"
Chapter"6"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#1%
Course"Chapters"
! Introduc@on"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"Acquisi@on"
! Evalua,ng%Input%Data%
! Data"Transforma@on"
! Data"Analysis"and"Sta@s@cal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! Introduc@on"to"Apache"Mahout"
! Implemen@ng"Recommenders"with"Apache"Mahout"
! Experimenta@on"and"Evalua@on"
! Produc@on"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#2%
Evalua@ng"Input"Data"
In%this%chapter%you%will%learn%
!Which%le%types%are%commonly%used%for%input%and%output%
!What%are%the%advantages%and%disadvantages%of%these%le%types%
!Several%ways%to%examine%data%on%the%command%line%and%at%scale%
!How%sampling%and%ltering%data%can%improve%your%processing%
!What%data%quality%problems%are%typical%and%how%you%can%x%them%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#3%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data%formats%
! Data"quan@ty"
! Data"quality"
! Review"ques@ons"
! Essen@al"points"
! Hands/On"Exercise:"Evalua@ng"Input"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#4%
Data"Formats"
!Data%comes%in%many%formats%
Format:"the"structure"and"encoding"used"to"represent"informa@on"
!Format%is%primarily%important%at%three%points%in%the%process%
1. Format"of"the"data"provided"to"you"or"collected"by"you"
2. Format"used"as"input"to"the"analysis"
3. Format"produced"as"output"from"the"analysis"
!Some%formats%are%beJer%suited%to%certain%uses%than%the%others%
!Data%is%some,mes%converted%to%other%formats%during%processing%
!Some%formats%map%to%a%rela,onal%model%more%than%others%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#5%
Log"Files"
!Log%les%are%generated%by%applica,ons%and%devices%
Examples:"Web"servers,"mail"servers,"Hadoop,"cell"phones"
!Data%scien,sts%view%logs%as%a%valuable%source%of%informa,on%
Contain"data"thats"too"expensive"to"store"in"a"transac@onal"DB"
Data"is"available"immediately""no"need"to"wait"for"ETL"process"
Log"analysis"does"not"require"pu[ng"load"on"produc@on"system"
$ head -n1 httpd.log
192.168.5.137 - - [17/Aug/2012:21:18:36 -0600]
"GET /products/widget.jsp?sku=16879 HTTP/1.1" 200 8472
"http://www.example.com" "Mozilla/5.0 (Windows NT 5.1; rv:2.0) Gecko/
20100101 Firefox/4.0" "uid=jsmith;usertype=Customer;region=midwest"
$ cut -f1 -d' ' httpd.log | sort | uniq -c | sort -rn | head n2
283 192.168.5.137
79 10.9.8.47
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#6%
Fixed/Width"and"Delimited"Text"Files"
!Data%is%some,mes%provided%as%elds%in%text%les%
Common"for"data"exported"from"databases"or"spreadsheets"
Typically"one"record"per"line""
!Two%main%variants%
Fixed/width:"eld"starts"at"posi@on"M"and"occupies"N"characters"
Delimited:"elds"separated"by"characters"such"as"comma"or"tab"
!CSV%les%can%be%decep,vely%dicult%to%parse%
There"is"a"specica@on,"but"few"follow"it"exactly"
Varia@ons"on"quo@ng,"embedded"commas,"missing"elds,"etc."
$ cut -c10-14 fixedwidth.txt
$ cut f3,5 mydata.tsv
$ cut -d, -f2 mydata.csv| sed -e 's/"//g'
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#7%
XML"and"HTML"
!Data%is%commonly%made%available%in%XML%or%HTML%format%
However,"neither"is"an"ideal"format"for"analysis"at"scale"
!XML%is%a%self#describing%hierarchical%text%format%
XML"is"well/formed"and"can"be"validated"for"compliance"
Verbose"format"consumes"much"storage"and"memory"
!HTML%is%a%closely%related%type%used%for%Web%pages%
Likelier"to"deviate"from"spec"and"have"less"structure"than"XML"
Content"and"forma[ng"intertwined,"especially"in"older"documents"
$ head n4 customers.xml
<customer>
<id>1234</id>
<name type="display">Smith, Jane</name>
</customer>
$ perl -ne 'print m|<id>\s*(\d+)\s*</id>(\n)|;' < customers.xml
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#8%
JSON"
!JSON%is%an%alterna,ve%to%XML%
It"oers"many"of"the"benets,"but"with"fewer"drawbacks"
Format"is"also"hierarchical"and"self/describing"
Much"less"verbose"than"XML"
!Despite%its%JavaScript%origins,%its%supported%by%many%languages%
{
"id":573698,
"name":"John Smith",
"address":"123 Hadoop Drive",
"zipcode":"90210",
"email":"jsmith@example.com",
"phone_numbers": [
{ "type":"mobile", "number":"(213) 555-1953" },
{ "type":"work", "number":"(310) 555-2752" },
{ "type":"home", "number":"(310) 555-7419" },
],
}
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#9%
Binary"Input"Formats"
!Not%all%data%collected%is%text#based%
Images"
Spreadsheets"
Word"processor"and"PDF"documents"
Audio"and"video"
!These%formats%are%not%necessarily%ideal%for%analysis%%
Not"na@vely"supported"by"Hadoop"or"ecosystem"tools"
Analysis"typically"involves"format/specic"custom"code"
!OZen%beJer%to%convert%to%a%text#based%format%before%processing%
For"example,"convert"Excel"to"CSV,"or"PDF"to"text"
This"may"not"be"possible"for"some"formats"(such"as"images)"
!In%some%cases,%only%the%metadata%is%actually%needed%for%analysis%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#10%
SequenceFiles"
!SequenceFiles%are%a%Hadoop#specic%format%
Flat"binary"le"consis@ng"of"key/value"pairs"
You"are"unlikely"to"receive"data"in"this"format,"but"may"produce"it"
Hadoop,"Hive"and"Pig"all"support"this"format"well"
!SequenceFiles%oer%beJer%performance%than%text#based%formats%
No"need"to"convert"na@ve"types"to"and"from"String"values"
Compression"(op@onal)"may"oer"s@ll"be=er"performance"
!Unfortunately,%SequenceFiles%are%closely%,ed%to%Java%
Cannot"access"data"easily"from"other"languages"
Even"changing"your"Java"classes"can"break"compa@bility"
!SequenceFiles%are%benecial%for%intermediate%data%
Limita@ons"described"above"aect"use"for"long/term"storage"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#11%
Avro"
!Avro%is%an%Apache%project%for%data%serializa,on%
It"addresses"many"of"the"shortcomings"of"SequenceFiles"
Binary"data"format"is"concise"and"oers"good"performance"
Language"support"(C,"C++,"C#,"Java,"perl,"Python,"Ruby,"PHP)"
!Avro%works%with%Hadoop%MapReduce%
In"Java"via"AvroMapper"and"AvroReducer"classes"
In"other"languages"via"AvroAsTextInputFormat"and""
AvroKeyValueOutputFormat"in"a"Streaming"job"
!Avro%also%works%with%both%Hive%and%Pig%
!Data%can%be%easily%collected%in%Avro%format%using%Flume%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#12%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data"formats"
! Data%quan,ty%
! Data"quality"
! Review"ques@ons"
! Essen@al"points"
! Hands/On"Exercise:"Evalua@ng"Input"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#13%
Data"Quan@ty"Considera@ons"
!Data%quan,ty%is%a%dening%characteris,c%of%data%science%
!However,%preliminary%analysis%is%oZen%best%done%with%less%data%
Smaller"data"sets"mean"faster"execu@on"@mes"
Faster"execu@on"@mes"allow"for"more"itera@ons"
More"itera@ons"provides"more"opportuni@es"to"rene"solu@on"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#14%
Filtering"
!The%goal%of%ltering%is%to%limit%the%amount%of%data%
Include"only"certain"records"
Exclude"only"certain"records"
Isolate"only"those"elds"relevant"for"analysis"
Can"also"combine"any"of"these"approaches"
!This%can%have%profound%impact%on%performance%
Elimina@ng"data"before"it"is"processed"will"usually"improve"performance"
more"than"any"op@miza@on"of"your"analysis"code"
$ gunzip -c logfile.gz | grep jsmith > only_jsmith.log
$ gunzip -c logfile.gz | grep -v 'GET /img/' > no_images.log
$ egrep '^63[0-3][0-9]{2}' clients.txt | egrep vi 'smith|johnson' | less
$ cut -f1,3,9 mydata.tsv > mydata-3cols.txt
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#15%
Filtering"(contd)"
!Using%grep%is%convenient,%but%it%doesnt%scale%well%
!How%do%you%lter%terabytes%of%data%quickly?%
Hadoop"lets"you"divide"and"conquer"across"many"nodes"
Easy"to"do"in"Python,"using"Hadoop"Streaming"
$ cat mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
if "jsmith" in line.lower():
print '%s' % line
$ hadoop jar \
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-stream*.jar \
-mapper mapper.py file mapper.py \
-input /user/training/input \
-output /user/training/output
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#16%
Sampling"
!Sampling%allows%you%to%capture%a%subset%of%your%data%
Its"easier"to"examine"and"explore"this"sample"than"huge"data"set"
Diers"from"ltering"in"that"eventually"youll"want"all"the"data"
!There%are%dierent%sampling%strategies%
Extrac@ng"a"set"of"similar"records"
Choosing"values"at"random"
Inten@onally"selec@ng"extreme"values"
$ head -n100 purchases.log
$ gunzip -c httpd.log.gz | tail -n300
$ dd bs=1M count=25 if=/home/training/foo.dat of=/tmp/25_megs_of_foo.dat
$ split -C5000 httpd-big.log httpd_
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#17%
Sampling"(contd)"
!How%can%we%get%a%random%sample%at%scale?%
Also"easy"to"do"with"Hadoop"Streaming"and"Python"
The"example"below"has"a"~"1%"chance"of"selec@ng"any"record"
Job"is"submi=ed"as"in"the"previous"example"
$ cat mapper.py
#!/usr/bin/env python
import sys
import random
include_pct = 0.01
for line in sys.stdin:
line = line.strip()
if random.random() < include_pct:
print '%s' % line
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#18%
Measuring"Input"Data"with"Counters"
!You%might%want%to%track%types%of%records%seen%while%processing%
How"many"bad"records"did"you"encounter?"
How"many"requests"were"for"JPG"versus"PNG?"
How"many"users"accessed"a"certain"URL?"
!Hadoop%supports%counters%and%groups%of%counters%
Can"be"incremented"during"processing""
In"Streaming,"counters"are"updated"by"prin@ng"to"STDERR"
Result"is"shown"in"logs"and"Web"UI"once"job"is"complete"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#19%
Measuring"Input"Data"with"Counters"(contd)"
!General%format:%reporter:counter:<group>,<counter>
!Python%excerpt%below%shows%how%to%count%image%types%
Group"name"is"FILE_TYPE
Counter"name"is"one"of"JPG,"PNG,"or"OTHER"
if ".jpg" in line.lower():
sys.stderr.write("reporter:counter:FILE_TYPE,JPG\n")
elif ".png" in line.lower():
sys.stderr.write("reporter:counter:FILE_TYPE,PNG\n")
else:
sys.stderr.write("reporter:counter:FILE_TYPE,OTHER\n")
There"are"limits"on"the"number"of"counters"and"counter"groups
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#20%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data"formats"
! Data"quan@ty"
! Data%quality%
! Review"ques@ons"
! Essen@al"points"
! Hands/On"Exercise:"Evalua@ng"Input"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#21%
Data"Quality"Overview"
!Quality%problems%are%inevitable%in%a%suciently%large%data%set%
!Three%main%types%of%problem%
Inconsistent:"correct"but"with"minor"forma[ng"varia@ons"
Invalid:""incorrect"but"conforms"to"expected"format"
Corrupt:"doesnt"conform"to"expected"format"at"all"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#22%
Common"Problems"
!Inconsistencies%in%case%of%string%values%
CA"versus"ca"
Recommenda@on:"always"convert"to"one"case"
!Inconsistencies%in%date%formats%
12/31/2012"versus"Dec."31,"2012"
Recommenda@on:"always"convert"to"one"format"
A"string"like"20121231"occupies"less"space"and"sorts"correctly"
!Inconsistencies%in%,mes%
Is"12:00:00"noon"or"midnight?""What"@me"zone?"
Recommenda@on:"use"a"24/hour"format""
Recommenda@on:"use"a"consistent"@me"zone,"such"as"GMT"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#23%
Common"Problems"(contd)"
!Dierences%in%how%missing%values%are%represented%
Does"it"use"NULL"or"N/A"or"zero"or"spaces"or"an"empty"string?"
Recommenda@on:"use"as"few"representa@ons"as"possible"
May"need"one"for"strings"and"another"for"numeric"elds"
!Varia,ons%in%free#form%input%
CA"versus"California"(might"also"be"misspelled"in"various"ways)"
Recommenda@on:"limit"free/form"input"if"possible"
Recommenda@on:"scan"to"nd"all"varia@ons,"then"normalize"
$ cut -f5 data.tsv | sort | uniq c | sort -rn
9887 CA
8 California
3 CA.
1 Cailfornia
1 Caleefornya
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#24%
Iden@fying"Bad"Data"
!Small%scale%strategies%
Examine"columns"with"UNIX"tools"like"head,"cut,"and"awk
Write"custom"code"that"inspects"records"
!Large%scale%strategies%
Use"counters"in"a"Hadoop"job"to"count"bad"records"
Use"logging"in"a"Hadoop"job"to"log"unexpected"data"
Exercise"cau@on"with"this"approach!"
Only"log"bad"records,"not"all"records"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#25%
Resolu@on"Techniques"
!How%do%you%x%the%bad%data%once%youve%iden,ed%it?%
Fix"the"data"upstream"to"avoid"the"issue"altogether"
Pre/process"the"data"to"x"the"problem"before"analysis"
Correct"bad"data"on"the"y"during"analysis"
Ignore"bad"data"as"you"nd"it"during"analysis"
!Which%is%best%depends%on%a%few%factors%
Can"you"control"how"data"is"generated?"
How"valuable"is"the"data?"
Will"you"analyze"this"data"more"than"once?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#26%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data"formats"
! Data"quan@ty"
! Data"quality"
! Review%ques,ons%
! Essen@al"points"
! Hands/On"Exercise:"Evalua@ng"Input"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#27%
Review"Ques@ons"
!Why%are%SequenceFiles%a%poor%choice%for%archiving%data?%
!Imagine%that%youre%a%data%scien,st%working%for%a%consumer%marke,ng%
company.%%The%US%Census%Bureau%has%just%released%the%data%for%the%2010%
census%and%you%plan%to%add%it%to%your%Hadoop%cluster.%%At%what%point%
would%you%check%for%inconsistencies%in%the%data%and%what%would%you%do%to%
correct%them?%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#28%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data"formats"
! Data"quan@ty"
! Data"quality"
! Review"ques@ons"
! Essen,al%points%
! Hands/On"Exercise:"Evalua@ng"Input"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#29%
Essen@al"Points"
!Some%input%formats%are%best%suited%for%input,%some%for%intermediate%data%
and%others%for%nal%output%
!Filtering%irrelevant%data%can%signicantly%improve%performance%
!Data%sampling%provides%a%subset%thats%easier%to%work%with%
!Issues%with%data%quality%are%inevitable%at%scale%%you%must%iden,fy%and%
correct%them%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#30%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data"formats"
! Data"quan@ty"
! Data"quality"
! Review"ques@ons"
! Essen@al"points"
! Hands#On%Exercise:%Evalua,ng%Input%Data%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#31%
Hands/on"Exercise:"Evalua@ng"Data"
!In%this%Hands#On%Exercise,%you%will%gain%prac,ce%evalua,ng%the%quality%of%
the%input%data%acquired%during%the%previous%exercise,%determining%what%
problems%it%contains,%and%then%correc,ng%these%problems%
The"result"of"this"will"be"high/quality"data"that"youll"use"in"upcoming"
exercises"
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc,ons%on%exercise%
#2%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#32%
Chapter"Topics"
Evalua,ng%Input%Data%
! Data"formats"
! Data"quan@ty"
! Data"quality"
! Review"ques@ons"
! Essen@al"points"
! Hands/On"Exercise:"Evalua@ng"Input"Data"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#33%
Evalua@ng"Input"Data"
In%this%chapter%you%have%learned%
!Which%le%types%are%commonly%used%for%input%and%output%
!What%are%the%advantages%and%disadvantages%of%these%le%types%
!Several%ways%to%examine%data%on%the%command%line%and%at%scale%
!How%sampling%and%ltering%data%can%improve%your%processing%
!What%data%quality%problems%are%typical%and%how%you%can%x%them%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#34%
Bibliography"
The%following%oer%more%informa,on%on%topics%discussed%in%this%chapter%
!Data%Interoperability%with%Apache%Avro%
http://tiny.cloudera.com/dscc06a
!Execu,ng%Data%Quality%Projects%%
http://tiny.cloudera.com/dscc06b
!The%Data%Wrangler%Project%
http://tiny.cloudera.com/dscc06c
!Best%Prac,ces%in%Data%Cleaning%
http://tiny.cloudera.com/dscc06d
!Exploratory%Data%Analysis%by%John%Tukey%
http://tiny.cloudera.com/dscc06e
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
06#35%
Data"TransformaCon"
Chapter"7"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#1%
Course"Chapters"
! IntroducCon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiCon"
! EvaluaCng"Input"Data"
! Data%Transforma1on%
! Data"Analysis"and"StaCsCcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducCon"to"Apache"Mahout"
! ImplemenCng"Recommenders"with"Apache"Mahout"
! ExperimentaCon"and"EvaluaCon"
! ProducCon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#2%
Data"TransformaCon"
In%this%chapter%you%will%learn%
!Why%you%might%wish%to%convert%le%formats%prior%to%analysis%
!How%you%can%join%both%small%and%large%data%sets%
!What%anonymiza1on%is%and%why%its%important%
!How%re#iden1ca1on%can%expose%an%organiza1on%to%liability%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#3%
Chapter"Topics"
Data%Transforma1on%
! File%format%conversion%
! Joining"data"sets"
! AnonymizaCon"
! Review"quesCons"
! EssenCal"points"
! Hands/On"Exercise:"Transforming"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#4%
File"Format"Conversion"
!Some1mes%data%isnt%provided%in%the%same%format%you%require%
The"format"might"be"suitable"for"data"collecCon"but"not"analysis"
It"might"not"be"appropriate"at"expected"scale"
It"might"not"be"supported"by"the"tool"you"need"
Another"format"might"oer"be=er"performance"
Another"format"might"be"be=er"for"long/term"storage"
!The%solu1on%is%oMen%to%convert%data%to%another%format%
"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#5%
File"Format"Conversion"
!Approaches%to%le%format%conversion%for%small%data%sets%
UNIX"command/line"(tr,"join,"paste,"sed,"awk,"etc.)"
Use"conversion"uCliCes"like"ImageMagick,"Cdy"or"poppler/uCls"
Use"an"applicaCons"export"or"Save"As"feature"
Write"script"or"small"program"to"run"on"a"single"machine"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#6%
File"Format"Conversion"(contd)"
!Approaches%to%large#scale%conversion%
Distributed"conversion"with"custom"code"in"Map/only"Hadoop"job"
Use"Hadoop"to"write"records"in"new"format"
SequenceFileOutputFormat
AvroOutputFormat
Custom"subclasses"of"FileOutputFormat
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#7%
Brief"IntroducCon"to"Apache"Hive
!Another%way%of%conver1ng%le%formats%involves%using%Apache%Hive%
Lets"rst"briey"cover"what"Hive"is"and"what"it"can"do"
!Hive%is%an%alterna1ve%to%wri1ng%low#level%MapReduce%code%
Users"can"analyze"data"stored"in"Hadoop"data"via"HiveQL"
HiveQL"is"a"declaraCve"language"very"similar"to"SQL"
!Hive%does%not%turn%your%Hadoop%cluster%into%a%database%
Instead,"the"Hive"interpreter"turns"HiveQL"into"MapReduce"jobs"
Hive"tables"are"simply"directories"of"data"stored"in"HDFS"
The"create table"statement"instructs"Hive"how"to"parse"it"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#8%
Brief"IntroducCon"to"Apache"Hive"(contd)
!Hive%is%especially%useful%for%joining%data%
Well"cover"this"later"in"the"chapter,"but"heres"an"example"in"HiveQL"
SELECT
FROM
ON
WHERE
GROUP BY
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#9%
Hive"SerDes"
!Hive%can%read%and%write%data%in%many%le%formats%
Via"implementaCons"of"the"Serializer/Deserializer"(SerDe)"API"
!There%are%many%SerDes%available%for%Hive,%including%
Delimited"text"le"
RegexSerde"
JSON"
Avro"
!Its%also%possible%to%implement%your%own%custom%SerDe%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#10%
Regex"Serde"Example"
!Load%a%log%le%into%a%three%column%table%
Sample"input"le"
30-Nov-2012 23:15:21 "Unusual event detected"
01-Dec-2012 01:33:02 "System shutdown"
01-Dec-2012 01:34:59 "System restarted"
Create"table"example"
CREATE TABLE LOGDATA (date_str STRING, time_str STRING, msg STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\d{2}\-\\w{3}\-\\d{4})\\s+(\\d{2}:\\d{2}:\
\d{2})\\s+(\\w+)\\s+\\\"(.+)\\\"\\s*"
)
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#11%
Chapter"Topics"
Data%Transforma1on%
! File"format"conversion"
! Joining%data%sets%
! AnonymizaCon"
! Review"quesCons"
! EssenCal"points"
! Hands/On"Exercise:"Transforming"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#12%
Joining"Data"Sets"
!Joins%are%a%common%opera1on%with%rela1onal%data%
Two"disCnct"data"sets"are"joined"based"on"a"common"key"eld"
!Data%scien1sts%use%joins%to%connect%disparate%data%sets%
This"provides"insight"that"no"single"data"set"could"
!Joining%data%is%an%expensive%opera1on%
Its"important"to"amorCze"the"cost"of"doing"joins"
Do"joins"once,"up"front"
Customers
1
2
3
4
5
6
Alice
Bob
Chuck
Darla
Eduardo
Fran
Orders
1
2
1
3
2
5
3
6
$29.32
$8.57
$14.78
$21.95
$4.92
$3.74
$16.73
$26.52
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#13%
Joining"Data"Sets"(contd)"
!The%UNIX%join%command%can%be%used%for%small%data%sets%
Will"join"on"key"eld"in"rst"column"by"default"
OpCons"allow"key"eld"in"a"dierent"column"for"each"le"
$ cat customers.txt
1
Alice
2
Bob
3
Chuck
$ cat orders.txt
1
$3.97
1
$19.34
2
$8.55
3
$8.22
$
1
1
2
3
07#14%
Joining"Larger"Data"Sets"
!Doing%joins%in%a%rela1onal%database%may%be%an%op1on%
If"the"data"originated"in"the"RDBMS"
Do"the"join"in"a"SQL"statement"
Export"the"result"to"a"le"for"analysis"
!Its%also%possible%to%join%data%with%Hadoop%using%MapReduce%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#15%
Joining"Data"Sets"with"Hive"
!Hive%is%an%alterna1ve%to%wri1ng%low#level%MapReduce%code%
!Joining%data%sets%with%Hive%is%easy%
Usually"preferable"to"wriCng"MapReduce"code"to"do"joins"
!Benets%of%using%Hive%for%joins%
Far"less"code"
Much"quicker"to"write"
Less"chance"for"error"
Requires"far"less"skill,"so"its"accessible"to"more"people"
!Disadvantages%of%using%Hive%
Slightly"less"control"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#16%
Joining"Data"Sets"with"Hive"(contd)"
!The%following%is%an%example%of%a%join%in%Hive%
This"is"equivalent"to"several"dozen"lines"of"MapReduce"code"
SELECT customer.fname, customer.lname, customer.email,
order.date, order.amount
FROM customer
JOIN order ON (customer.cid = order.cid)
WHERE order.amount >= 50;
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#17%
Joining"Data"Sets"with"Hive"(contd)"
!Recall%two%important%points%men1oned%previously%
Joining"disparate"data"sets"yields"insight"
Joins"are"expensive""amorCze"the"cost"by"doing"them"only"once"
!Hive%is%very%scalable%%
Joins"may"produce"dozens"or"hundreds"of"columns"
!Common%to%output%a%huge%single%le%from%several%data%sets%
Everything"a"hospital"knows"about"a"paCent"
Name,"contact"info,"insurance"info,"medical"history,"etc."
Everything"a"company"knows"about"a"customer"
Name,"demographics,"past"orders,"Web"site"sessions,"etc."
Everything"a"manufacturer"knows"about"a"product"
ConguraCons,"part"numbers,"suppliers,"defect"history,"etc."
%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#18%
Chapter"Topics"
Data%Transforma1on%
! File"format"conversion"
! Joining"data"sets"
! Anonymiza1on%
! Review"quesCons"
! EssenCal"points"
! Hands/On"Exercise:"Transforming"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#19%
AnonymizaCon"
!Anonymiza1on%is%the%process%for%removing%PII%from%data%
IDs"
Names"
Addresses"
Phone"numbers"
PotenCally"many"other"kinds"of"informaCon"
!Why%anonymize%data?%
Laws"may"require"it,"parCcularly"for"nance"and"healthcare"data"
Industry"standards"
Company"policies"
Protects"against"a=ack"
07#20%
AnonymizaCon"(contd)"
!Limi1ng%access%to%non#anonymized%data%is%essen1al%
Original"data"should"be"readable"by"as"few"people"as"possible"
!Typical%anonymiza1on%process%is%to%remove%iden1fying%columns%
First"join"all"data"sets"
Next"remove"all"ID"elds"needed"only"for"the"join"
Finally,"suppress"elds"which"contain"any"PII"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#21%
Re/IdenCcaCon"
!Several%companies%have%been%aected%by%re#iden1ca1on%
For"example:"Researchers"found"Neilix"prize"data"could"re/idenCfy"
people"
Past"raCngs"may"expose"poliCcal"and"sexual"orientaCon"
$5"billion"class"acCon"suit"led"and"later"se=led"
!Include%only%the%minimum%data%required%for%the%intended%purpose%
A"single"eld"might"not"be"PII,"but"a"combinaCon"of"them"might"
Research"has"shown"that"87%"of"U.S."populaCon"can"be"uniquely"
idenCed"from"gender,"ZIP"code"and"date"of"birth"
!Using%par1al%values%can%further%anonymize%data%
Use"only"rst"three"digits"of"ZIP"code"
Retain"the"year"of"birth"but"exclude"the"month"and"day"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#22%
Chapter"Topics"
Data%Transforma1on%
! File"format"conversion"
! Joining"data"sets"
! AnonymizaCon"
! Review%ques1ons%
! EssenCal"points"
! Hands/On"Exercise:"Transforming"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#23%
Review"QuesCons"
!Which%laws,%regula1ons,%or%policies%mandate%anonymiza1on?%
What"are"some"other"reasons"you"may"want"to"anonymize"data?"
!How%would%you%convert%a%single%Excel%spreadsheet%to%CSV?%
How"would"you"convert"100,000"Excel"spreadsheets"to"CSV?"
!Why%should%you%join%data%sets%early%in%the%project%and%why%would%you%
store%the%result%of%data%sets%youve%joined?%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#24%
Chapter"Topics"
Data%Transforma1on%
! File"format"conversion"
! Joining"data"sets"
! AnonymizaCon"
! Review"quesCons"
! Essen1al%points%
! Hands/On"Exercise:"Transforming"Data"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#25%
EssenCal"Points"
!Anonymiza1on%removes%personally%iden1able%informa1on%(PII)%from%
your%data%and%may%be%required%by%laws,%regula1ons,%or%company%policies%
!Data%isnt%always%provided%in%the%format%you%need,%so%you%may%have%to%
convert%it%
!You%should%join%data%sets%once%%as%early%as%possible%%and%store%the%result%
to%allow%you%to%amor1ze%the%cost%of%this%opera1on%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#26%
Chapter"Topics"
Data%Transforma1on%
! File"format"conversion"
! Joining"data"sets"
! AnonymizaCon"
! Review"quesCons"
! EssenCal"points"
! Hands#On%Exercise:%Transforming%Data%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#27%
Hands/on"Exercise:"Transforming"Data"
!In%this%Hands#On%Exercise,%you%will%gain%prac1ce%using%Hive%to%join%the%
disparate%data%sets%youve%previously%acquired%
This"will"produce"a"JSON"object"for"each"user"to"be"further"analyzed"in"
upcoming"exercises"
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instruc1ons%on%exercise%
#3%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#28%
Chapter"Topics"
Data%Transforma1on%
! File"format"conversion"
! Joining"data"sets"
! AnonymizaCon"
! Review"quesCons"
! EssenCal"points"
! Hands/On"Exercise:"Transforming"Data"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#29%
Data"TransformaCon"
In%this%chapter%you%have%learned%
!Why%you%might%wish%to%convert%le%formats%prior%to%analysis%
!How%you%can%join%both%small%and%large%data%sets%
!What%anonymiza1on%is%and%why%its%important%
!How%re#iden1ca1on%can%expose%an%organiza1on%to%liability%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#30%
Bibliography"
The%following%oer%more%informa1on%on%topics%discussed%in%this%chapter%
!Introduc1on%to%Data%Anonymiza1on%
http://tiny.cloudera.com/dscc07a
!Fast%Data%Anonymiza1on%With%Low%Informa1on%Loss%%
http://tiny.cloudera.com/dscc07b
!Resis1ng%Re#Iden1ca1on%in%Anonymized%Social%Networks%%
http://tiny.cloudera.com/dscc07c
!The%Regular%Expressions%Cheat%Sheet%
http://tiny.cloudera.com/dscc07d
!A%Comparison%of%Join%Algorithms%in%MapReduce%
http://tiny.cloudera.com/dscc07e
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
07#31%
Data"Analysis"and"StaAsAcal"Methods"
Chapter"8"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#1%
Course"Chapters"
! IntroducAon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiAon"
! EvaluaAng"Input"Data"
! Data"TransformaAon"
! Data%Analysis%and%Sta2s2cal%Methods%
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducAon"to"Apache"Mahout"
! ImplemenAng"Recommenders"with"Apache"Mahout"
! ExperimentaAon"and"EvaluaAon"
! ProducAon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#2%
Data"Analysis"and"StaAsAcal"Methods"
In%this%chapter%you%will%learn%
!How%sta2s2cs%and%probability%are%related%
!How%sca@erplots%can%help%you%iden2fy%numeric%errors%
!How%to%evaluate%data%distribu2on%
!How%extreme%values%can%mislead%you%
!How%regression%analysis%can%help%to%predict%missing%values%
!Which%types%of%variables%are%important%in%regression%analysis%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#3%
Chapter"Topics"
Data%Analysis%and%%
Sta2s2cal%Methods%
! Rela2onship%between%sta2s2cs%and%probability%
! DescripAve"staAsAcs"
! InferenAal"staAsAcs"
! Review"quesAons"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#4%
Comparison"between"Probability"and"StaAsAcs"
!Probability%deals%with%the%predic2on%of%future%events%
This"is"inherently"theoreAcal,"as"its"based"on"an"ideal"world"
Example:"Which"movies"is"this"user"likely"to"enjoy?"
!Sta2s2cs%deals%with%measurements%from%past%events%
This"is"inherently"more"pracAcal,"as"its"based"on"the"real"world"
Example:"Which"movies"did"this"user"rate"highest?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#5%
The"Cycle"of"PredicAon"and"Measurement"
!Data%scien2sts%regularly%employ%both%predic2on%and%measurement%
Predict"whats"likely"to"happen"given"a"parAcular"set"of"circumstances"
Then,"conduct"an"experiment"to"measure"the"accuracy"of"predicAon"
!This%approach%is%cyclical%
Results"from"past"experiments"inuence"future"predicAons"
!Ul2mately,%sta2s2cs%and%probability%are%closely%related%
They"can"be"viewed"as"two"sides"of"the"same"coin"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#6%
RelaAonship"of"Probability"and"StaAsAcs"
Probability
Observed data
All(of(Sta-s-cs:(A(Concise(Course(in(Sta-s-cal(Inference(
Larry"Wasserman,"2003"
!Probability:%given%a%data%genera2ng%process%
What"are"the"properAes"of"the"outcomes?"
!Sta2s2cs:%given%the%outcomes%
What"can"we"say"about"the"process"that"generated"the"data?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#7%
Chapter"Topics"
Data%Analysis%and%%
Sta2s2cal%Methods%
! RelaAonship"between"staAsAcs"and"probability"
! Descrip2ve%sta2s2cs%
! InferenAal"staAsAcs"
! Review"quesAons"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#8%
DescripAve"StaAsAcs"
!Descrip2ve%sta2s2cs%answer%ques2ons%about%data%distribu2on%
What"is"the"range"of"values"(distance"from"min"to"max)?"
Where"are"values"concentrated"within"this"range?"
!Understanding%data%distribu2on%is%a%preliminary%step%
Helps"to"expose"dirty"data"that"you"should"correct"or"remove"
Assists"you"in"nding"interesAng"pa=erns"within"the"data"
!Distribu2on%may%aect%how%further%analysis%is%performed%
!Visualiza2on%is%an%essen2al%tool%for%analyzing%distribu2ons%
Sca=erplots"and"histograms"are"parAcularly"helpful"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#9%
Exposing"Data"Errors"with"Sca=erplots"
!Data%may%contain%bugs%just%as%soZware%does%
These"can"be"dicult"to"nd"and"may"lead"to"invalid"conclusions"
Looking"at"a"set"of"numbers"seldom"uncovers"the"problem"
6.112"""5.871"
6.917"""5.892"
6.547"""6.020"
6.823"""5.418"
5.879"""6.574"
6.554"""6.877"
5.741"""6.297"
5.447"""4.179"
5.847"""41.17"
5.974"""4.713"
6.247"""7.474"
6.551"""6.874"
5.441"""7.347"
6.774"""7.514"
6.018"""7.142"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#10%
Exposing"Data"Errors"with"Sca=erplots"(contd)"
!Sca@erplots%help%to%expose%poten2ally%invalid%data%
The"latent"problem"in"this"data"set"is"obvious"when"visualized"
6.112"""5.871"
6.917"""5.892"
6.547"""6.020"
6.823"""5.418"
5.879"""6.574"
6.554"""6.877"
5.741"""6.297"
5.447"""4.179"
5.847"""41.17"
5.974"""4.713"
6.247"""7.474"
6.551"""6.874"
5.441"""7.347"
6.774"""7.514"
6.018"""7.142"
45.000"
40.000"
35.000"
30.000"
25.000"
20.000"
15.000"
10.000"
5.000"
0.000"
0.000"
1.000"
2.000"
3.000"
4.000"
5.000"
6.000"
7.000"
8.000"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#11%
Finding"InteresAng"Pa=erns"with"Sca=erplots"
!Sca@erplots%also%help%point%out%rela2onships%in%data%
Session%Dura2on%vs.%Annual%Income%
Session"duraAon"(hours)"
6.00"
5.00"
4.00"
3.00"
2.00"
1.00"
0.00"
0"
20000"
40000"
60000"
80000"
100000"
120000"
Annual"income"(USD)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#12%
Histograms"
!A%histogram%illustrates%the%distribu2on%of%data%%
This"helps"you"to"compare"relaAve"frequency"
Movies Viewed by Day of Week (in millions)
7
6
5
4
3
2
1
Monday
Thursday
Friday
Saturday
Sunday
Wednesday
Tuesday
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#13%
Normal"DistribuAon"
!The%normal%distribu2on%of%data%is%oZen%called%the%bell%curve%
DistribuAon"is"symmetrical"about"the"mean"(average)"
More"than"two/thirds"lies"within"one"standard"deviaAon"
Standard"deviaAon"is"a"measure"of"dispersion"from"the"mean"
0.1%
-3
2.1%
13.6%
-2
-1
34.1%
34.1%
mean
13.6%
+1
+2
2.1%
0.1%
+3
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#14%
Skewed"DistribuAons"
!Data%oZen%deviates%from%the%normal%distribu2on%
Skewed"data"is"asymmetrically"concentrated"away"from"mean"
!Skew%is%natural,%but%you%should%understand%the%reason%for%it%
%
1.3%
15 minutes
-3
7.9%
33.6%
28.1%
75 minutes
-2
-1
12.1%
140 minutes
mean
+1
13.8%
0.4%
200 minutes
+2
+3
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#15%
Skewed"DistribuAons"(contd)"
!Our%set%of%movie%ra2ngs%are%also%skewed%
The"mean"(average)"value"should(be"3,"but"is"actually"3.53"
This"is"because"the"mode"(most"common"value)"is"4"
!Whats%the%cause%of%the%inated%ra2ngs?%
Distribution of Movie Ratings
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#16%
The"Eects"of"Extremal"Values"
!Extremal%values%are%common%in%real#world%data%sets%
These"are"very"large,"very"small,"or"very"rare"values"
!These%values%make%averages%(mean)%misleading%
In"such"cases,"the"median"is"a"be=er"measure"of"whats"typical"
!Example:%Cloudera%Movies%customers%annual%income%
Mean"household"income"of"our"customers"is"$127,396"
The"mean"is"skewed"by"a"few"very"auent"customers"
Our"customers"median"household"income"is"only"$47,835"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#17%
Using"PercenAles"to"Detect"Extremal"Values"
!Percen2les%can%help%to%iden2fy%extremal%values%
They"represent"the"point"below"which"a"percentage"of"values"fall"
The"typical"customer"lies"between"the"25th"and"75th"percenAles"
"
Annual Income of Cloudera Movies Customers
"
"
99th: $561,598
"
"
"
75th: $89,747
"
50th: $47,835
"
25th: $31,376
"
1st: $9,362
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#18%
Chapter"Topics"
Data%Analysis%and%%
Sta2s2cal%Methods%
! RelaAonship"between"staAsAcs"and"probability"
! DescripAve"staAsAcs"
! Inferen2al%sta2s2cs%
! Review"quesAons"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#19%
InferenAal"StaAsAcs"
!Inferen2al%sta2s2cs%a@empts%to%draw%conclusions%based%on%data%
Youre"esAmaAng"the"parameters"that"led"to"output"observed"
This"will"help"you"to"determine"what"to"do"next"in"your"analysis"
!Consider%the%results%of%this%Cloudera%Movies%customer%survey%
InferenAal"staAsAcs"can"help"us"evaluate"why(
Number"of"customers"responding"
$7.00
$8.00
$9.00
$10.00
$11.00
$12.00
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#20%
Dependent"and"Independent"Variables"
!There%are%two%main%variables%to%consider%
Dependent"
The"output"result"were"interested"in"measuring"
Independent"
Input"parameter(s)"were"tesAng"for"its"eect"on"output"
!You%must%account%for%covariates%
These"are"input"parameters"that"might"also"aect"result"
They"arent"tested,"but"must"be"controlled"to"avoid"interference"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#21%
Regression"Analysis"
!A%sta2s2cal%technique%for%analyzing%rela2onship%between%these%variables%
How"much"does"the"dependent"variable"change,"given"a"corresponding"
change"in"only"one"of"the"independent"variables?"
!For%example,%how%does%income%aect%the%fee%a%customer%is%willing%to%pay?%
Max"Acceptable"Monthly"Fee"(USD)"
Max%Acceptable%Monthly%Fee%vs.%Customer%Income%
13.00"
12.00"
11.00"
10.00"
9.00"
8.00"
7.00"
6.00"
5.00"
0"
10000" 20000" 30000" 40000" 50000" 60000" 70000" 80000" 90000" 100000"
Annual"income"(USD)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#22%
Regression"Analysis"and"Variable"Types"
!Several%factors%might%aect%what%a%customer%is%willing%to%pay%
Which"regression"analysis"technique"is"appropriate"depends"on"the"type"
of"dependent"variable"you"need"to"analyze"
!Con2nuous%variables%have%an%unbounded%range%of%values%
Customers"annual"income"
Customers"age"
!Categorical%variables%have%a%nite%set%of%values%
Movie"raAng"on"a"scale"of"1"to"5"
Customers"state"of"residence"
!Binary%variables%are%a%subset%of%categorical%variables%with%only%two%values%
Gender"(male"or"female)"
Do"you"subscribe"to"cable"television"(yes"or"no)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#23%
Linear"Regression"
!Appropriate%for%con2nuous%dependent%variables%
Such"as"income,"height,"duraAon,"age,"speed,"or"temperature"
!Basic%formula%for%linear%regression:%Y%=%X%+%%
Y"is"the"dependent"variable"
X"is"an"independent"variable"
Beta"is"a"coecient"that"show"change"in"Y"per"change"in"X"
Epsilon"represents"a"random"distribuAon"of"error"
!Observa2ons%for%X%must%be%independent%of%one%another%
Number"of"DVDs"in"our"catalog"one"year"inuences"next"year"
MulAple"raAngs"from"the"same"user"are"correlated"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#24%
LogisAc"Regression"
!Appropriate%for%binary%dependent%variables%
Such"as"is"or"is"not"spam"or"did"or"did"not"click"on"ad"
Can"be"used"to"model"likelihood"of"a"boolean"acAon"occurring"
%
$9.00
$10.00
$11.00
$12.00
$13.00
$14.00
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#25%
Chapter"Topics"
Data%Analysis%and%%
Sta2s2cal%Methods%
! RelaAonship"between"staAsAcs"and"probability"
! DescripAve"staAsAcs"
! InferenAal"staAsAcs"
! Review%ques2ons%
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#26%
Review"QuesAons"
!What%might%you%do%to%iden2fy%extreme%data%for%tes2ng?%
!Consider%the%following%hypothesis:%A%customer%is%more%likely%to%assign%a%
higher%ra2ng%to%a%given%movie%when%other%customers%in%the%same%ZIP%code%
gave%the%same%movie%a%higher%ra2ng%than%it%received%elsewhere.%%
What"is"the"dependent"variable?"
What"is"the"independent"variable?"
What"are"the"covariates?"
!What%variable%type%(con2nuous%or%binary)%are%the%following?%
Whether"or"not"customer"subscribes"to"cable"TV"
Length"of"Ame"at"job"
Value"of"customers"residence"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#27%
Chapter"Topics"
Data%Analysis%and%%
Sta2s2cal%Methods%
! RelaAonship"between"staAsAcs"and"probability"
! DescripAve"staAsAcs"
! InferenAal"staAsAcs"
! Review"quesAons"
! Essen2al%points%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#28%
EssenAal"Points"
!Sca@erplots%show%both%numeric%errors%and%interes2ng%pa@erns%
!Data%distribu2on%is%a%key%rst%step%in%analysis%
!Skewed%distribu2ons%and%extreme%values%can%mislead%you%
!Tes2ng%edge%cases%is%important%and%biased%sampling%can%help%
!Correla2on%does%not%imply%causa2on%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#29%
Chapter"Topics"
Data%Analysis%and%%
Sta2s2cal%Methods%
! RelaAonship"between"staAsAcs"and"probability"
! DescripAve"staAsAcs"
! InferenAal"staAsAcs"
! Review"quesAons"
! EssenAal"points"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#30%
Data"Analysis"and"StaAsAcal"Methods"
In%this%chapter%you%have%learned%
!How%sta2s2cs%and%probability%are%related%
!How%sca@erplots%can%help%you%iden2fy%numeric%errors%
!How%to%evaluate%data%distribu2on%
!How%extreme%values%can%mislead%you%
!How%regression%analysis%can%help%to%predict%missing%values%
!Which%types%of%variables%are%important%in%regression%analysis%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#31%
Bibliography"
The%following%oer%more%informa2on%on%topics%discussed%in%this%chapter%
!Causal%inference%in%sta2s2cs:%An%overview%(Judea%Pearl)%
http://tiny.cloudera.com/dscc08a
!Head%First%Data%Analysis%%
http://tiny.cloudera.com/dscc08b
!The%Art%of%R%Programming%
http://tiny.cloudera.com/dscc08c
!The%Future%of%Data%Analysis%
http://tiny.cloudera.com/dscc08d
!Regression%Modeling%Strategies%
http://tiny.cloudera.com/dscc08e
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
08#32%
Fundamentals"of"Machine"Learning"
Chapter"9"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#1%
Course"Chapters"
! IntroducFon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiFon"
! EvaluaFng"Input"Data"
! Data"TransformaFon"
! Data"Analysis"and"StaFsFcal"Methods"
! Fundamentals%of%Machine%Learning%
! Recommender"Overview"
! IntroducFon"to"Apache"Mahout"
! ImplemenFng"Recommenders"with"Apache"Mahout"
! ExperimentaFon"and"EvaluaFon"
! ProducFon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#2%
Fundamentals"of"Machine"Learning"
In%this%chapter%you%will%learn%
!What%machine%learning%is%%
!What%are%three%common%machine%learning%techniques%
!How%organizaCons%are%applying%these%techniques%
!What%is%the%relaConship%between%algorithms%and%data%volume%
!How%the%Nave%Bayes%classicaCon%algorithm%uses%probabiliCes%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#3%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview%
! The"three"Cs"of"machine"learning"
! Importance"of"data"and"algorithms"
! Spotlight:"Nave"Bayes"classiers"
! Review"quesFons"
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenFal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#4%
Fundamentals"of"Computer"Programming"
!Lets%rst%consider%how%a%typical%program%works%
Hardcoded"condiFonal"logic"
Predened"reacFons"when"those"condiFons"are"met"
$ cat spam-filter.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
if "Make MONEY Fa$t At Home!!!" in line:
print "This message is likely spam"
if "Happy Birthday from Aunt Betty" in line:
print "This message is probably OK"
!The%programmer%must%consider%all%possibiliCes%at%design%Cme%
!An%alternaCve%technique%is%to%have%computers%learn%what%to%do%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#5%
What"is"Machine"Learning"
!Machine%learning%is%a%eld%within%arCcial%intelligence%(AI)%
AI:"the"science"and"engineering"of"making"intelligent"machines"
!Machine%learning%focuses%on%automated%knowledge%acquisiCon%
Primarily"through"the"design"and"implementaFon"of"algorithms"
These"algorithms"require"empirical"data"as"input"
!Machine%learning%algorithms%learn%based%on%input%provided%
Amount"of"data"is"o\en"more"important"than"the"algorithm"itself"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#6%
What"is"Machine"Learning"(contd)"
!The%output%produced%varies%by%applicaCon%
Product"recommendaFons"
Items"grouped"based"on"similarity"
Possible"diagnosis"of"a"disease"
!These%are%examples%of%The%Three%Cs%of%machine%learning%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#7%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The%three%Cs%of%machine%learning%
! Importance"of"data"and"algorithms"
! Spotlight:"Nave"Bayes"classiers"
! Review"quesFons"
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenFal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#8%
The"Three"Cs"
!Three%established%categories%of%machine%learning%techniques:%
CollaboraFve"ltering"(recommendaFons)"
Clustering"
ClassicaFon"
!This%course%will%focus%on%collaboraCve%ltering%
Though"well"also"cover"a"simple"classicaFon"algorithm"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#9%
CollaboraFve"Filtering"
!CollaboraCve%ltering%is%a%technique%for%recommendaCons%
Its"one"primary"type"of"recommender"system"
Well"cover"it"in"detail"during"the"next"several"chapters"
!Helps%users%nd%items%of%relevance%
Among"a"potenFally"vast"number"of"choices"
Based"on"comparison"of"preferences"between"users"
Hi,$Bob.$$We$saw$how
much$you$liked$The$
Godfather.$
We$think$you'll$also$
enjoy$Goodfellas$and$
Casino.
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#10%
ApplicaFons"Involving"CollaboraFve"Filtering"
!CollaboraCve%ltering%is%domain%agnosCc%
!Can%use%the%same%algorithm%to%recommend%pracCcally%anything%
Movies"(Cloudera"Movies"oh,"and"Neclix"too)"
Television"(TiVO"SuggesFons)"
Music"(Several"popular"music"download"and"streaming"services)"
!Amazon%uses%CF%to%recommend%a%variety%of%products%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#11%
Clustering"
!Clustering%algorithms%discover%structure%in%collecCons%of%data%
Where"no"formal"structure"previously"existed"
!They%discover%what%clusters%(groupings),%naturally%occur%in%data%
By"examining"various"properFes"of"the"input"data"
!Clustering%is%o]en%used%for%exploratory%analysis%
Divide"huge"amount"of"data"into"smaller"groups"
Can"then"tune"analysis"for"each"group"
Price
Store
Online
Brand status
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#12%
ApplicaFons"Involving"Clustering"
!Market%segmentaCon%
Group"similar"customers"in"order"to"target"them"eecFvely"
!Finding%related%news%arCcles%
Google"News"
!Epidemiological%studies%
For"example,"idenFfying"cancer"cluster"and"nding"root"cause"
!Computer%vision%(groups%of%pixels%that%cohere%into%objects)%
Related"pixels"clustered"to"recognize"faces"or"license"plates"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#13%
ClassicaFon"
!The%previous%two%techniques%are%unsupervised,learning,
The"algorithm"discovers"recommendaFons"or"groups"
Large
!ClassicaCon%is%a%form%of%supervised%learning%
Requires"training"with"data"that"has"known"labels"
These"are"healthy"cells,"those"are"cancerous"
Learns"how"to"label"new"records"based"on"that"informaFon"
Size
Cheetah
Small
Turtle
Slow
Speed
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
Fast
09#14%
ApplicaFons"Involving"ClassicaFon"
!Spam%ltering%
Train"using"a"set"of"spam"and"non/spam"messages"
System"will"eventually"learn"to"detect"unwanted"e/mail"
!Oncology%%
Train"using"images"of"benign"and"malignant"tumors"
System"will"eventually"learn"to"idenFfy"cancer"
!Risk%Analysis%
Train"using"nancial"records"of"customers"who"do/dont"default"
System"will"eventually"learn"to"idenFfy"risk"customers"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#15%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The"three"Cs"of"machine"learning"
! Importance%of%data%and%algorithms%
! Spotlight:"Nave"Bayes"classiers"
! Review"quesFons"
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenFal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#16%
RelaFonship"of"Algorithms"and"Data"Volume"
!There%are%many%algorithms%for%each%type%of%machine%learning%
Theres"no"overall"best"algorithm"
Each"algorithm"has"advantages"and"limitaFons"
!Algorithm%choice%is%o]en%related%to%data%volume%
Some"scale"be=er"than"others"
!Most%algorithms%oer%becer%results%as%volume%increases%
Best"approach"="simple"algorithm"+"lots"of"data"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#17%
RelaFonship"of"Algorithms"and"Data"Volume"(contd)"
Its"not"who"has"the"best"algorithms"that"wins.""
Its"who"has"the"most"data."[Banko"and"Brill,"2001]"
1.00
0.95
Test Accuracy
%%
0.90
0.85
0.80
Memory-Based
Winnow
0.75
Perceptron
Naive Bayes
0.70
0.1
10
Millions of Words
100
1000
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#18%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The"three"Cs"of"machine"learning"
! Importance"of"data"and"algorithms"
! Spotlight:%Nave%Bayes%classiers%
! Review"quesFons"
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenFal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#19%
Nave"Bayes"Classiers"
!Nave%Bayes%is%a%simple%%but%eecCve%%classicaCon%algorithm%
!Based%on%the%concept%of%condiConal%probability%
How"likely"is"outcome"Z,"given"condiFons"X"and"Y"
Each"condiFon"is"evaluated"independently"
!Each%condiCon%must%saCsfy%two%constraints%
It"must"be"independent"of"every"other"condiFon"
All"of"the"independent"variables"must"be"binary""no"conFnuous"
variables"allowed"
!Spam%ltering%is%a%classic%example%of%nave%Bayes%classicaCon%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#20%
Nave"Bayes"Classiers"(contd)"
!Weve%analyzed%our%inbox%and%found%the%following%%
87.1%"of"messages"from"unknown"senders"are"spam"
"
12.9%"
Y%
Likelihood"message"is"spam"
From"
Known"
Sender?"
N%
87.1%"
Likelihood"message"is"spam"
94.7%"of"messages"menFoning"Rolex"are"spam"
94.7%"
Y%
Likelihood"message"is"spam"
MenFons"
Rolex?"
N%
5.3%"
Likelihood"message"is"spam"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#21%
Nave"Bayes"Classiers"(contd)"
!Applying%the%result%of%mulCple%tests%improves%overall%detecCon%
We"can"predict"how"likely"a"message"from"an"unknown"sender"that"
menFons"Rolex"is"likely"to"be"spam"
12.9%"
Known"
Sender?"
Y%
N%
MenFons"
Rolex?"
5.3%"
87.1%"
MenFons"
Rolex?"
N%
Y%
0.8%"
72.6%"
94.7%"
Y%
N%
99.2%"
27.4%"
94.7%"
5.3%"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#22%
ApplicaFons"of"Bayesian"Classiers"
!Bayesian%classicaCon%can%do%a%lot%more%than%lter%spam%
Medical"diagnosis"
Root"cause"analysis"
PredicFon"of"loan"default"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#23%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The"three"Cs"of"machine"learning"
! Importance"of"data"and"algorithms"
! Spotlight:"Nave"Bayes"classiers"
! Review%quesCons%
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenFal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#24%
Review"QuesFons"
!What%are%the%three%Cs%of%machine%learning?%
!ClassicaCon%algorithms%like%Nave%Bayes%are%used%in%many%areas%of%
everyday%life.%%What%are%some%ways%it%might%help%Cloudera%Movies?%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#25%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The"three"Cs"of"machine"learning"
! Importance"of"data"and"algorithms"
! Spotlight:"Nave"Bayes"classiers"
! Review"quesFons"
! Hands#On%Exercise:%Analysis%of%social%media%
! EssenFal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#26%
Hands/on"Exercise:"Analysis"of"Social"Media"
!In%this%Hands#On%Exercise,%you%will%gain%pracCce%performing%staCsCcal%
analysis%on%the%Cloudera%Movie%customer%data%
You"will"use"R"and"Python"to"analyze"the"social"media"data"collected"
about"our"customers.""This"informaFon"provides"you"insight"into"which"
movies"they"prefer"and"will"be"used"to"improve"our"recommender"in"a"
subsequent"hands/on"exercise."
!Please%refer%to%the%Hands#On%Exercise%Manual%for%instrucCons%on%exercise%
#4%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#27%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The"three"Cs"of"machine"learning"
! Importance"of"data"and"algorithms"
! Spotlight:"Nave"Bayes"classiers"
! Review"quesFons"
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenCal%points%
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#28%
EssenFal"Points"
!Machine%learning%algorithms%learn%based%on%data%provided%
!CollaboraCve%ltering%recommends%items%%
!Clustering%discovers%how%to%group%a%set%of%items%into%subsets%
!ClassicaCon%is%supervised%learning%that%can%idenCfy%item%types%
!More%data%is%usually%preferable%to%a%becer%algorithm%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#29%
Chapter"Topics"
Fundamentals%of%Machine%Learning%
! Overview"
! The"three"Cs"of"machine"learning"
! Importance"of"data"and"algorithms"
! Spotlight:"Nave"Bayes"classiers"
! Review"quesFons"
! Hands/On"Exercise:"Analysis"of"social"media"
! EssenFal"points"
! Conclusion%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#30%
Fundamentals"of"Machine"Learning"
In%this%chapter%you%have%learned%
!What%machine%learning%is%%
!What%are%three%common%machine%learning%techniques%
!How%organizaCons%are%applying%these%techniques%
!What%is%the%relaConship%between%algorithms%and%data%volume%
!How%the%Nave%Bayes%classicaCon%algorithm%uses%probabiliCes%
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#31%
Bibliography"
The%following%oer%more%informaCon%on%topics%discussed%in%this%chapter%
!Programming%CollecCve%Intelligence%
http://tiny.cloudera.com/dscc09a
!Andrew%Ngs%Online%Course%on%Machine%Learning%at%Coursera%
http://tiny.cloudera.com/dscc09b
!Learning%With%Large%Datasets%%
http://tiny.cloudera.com/dscc09c
!The%Elements%of%StaCsCcal%Learning%
http://tiny.cloudera.com/dscc09d
!Machine%Learning:%A%ProbabilisCc%PerspecCve%
http://tiny.cloudera.com/dscc09e
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
09#32%
Recommender"Overview"
Chapter"10"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#1$
Course"Chapters"
! IntroducCon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiCon"
! EvaluaCng"Input"Data"
! Data"TransformaCon"
! Data"Analysis"and"StaCsCcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender$Overview$
! IntroducCon"to"Apache"Mahout"
! ImplemenCng"Recommenders"with"Apache"Mahout"
! ExperimentaCon"and"EvaluaCon"
! ProducCon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#2$
Recommender"Overview"
In$this$chapter$you$will$learn$
!What$is$the$dierence$between$content#based$and$collabora?ve$ltering$
recommender$systems$
!Which$limita?ons$recommender$systems$frequently$encounter$
!How$collabora?ve$ltering$can$iden?fy$similar$users$and$items$
!How$Tanimoto$and$Euclidean$distance$similarity$metrics$work$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#3$
Chapter"Topics"
Recommender$Overview$
! What$is$a$recommender$system?$
! Types"of"collaboraCve"ltering"
! LimitaCons"of"recommender"systems"
! Fundamental"concepts"
! Review"quesCons"
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#4$
What"is"a"Recommender"System?"
!Recommenders$are$a$type$of$lter$
!They$help$users$nd$relevant$items$within$a$huge$selec?on$
How"do"you"nd"an"interesCng"movie"among"95,000"choices?"
They"help"you"nd"things"you"didnt"know"to"look"for"
!Recommenders$use$preferences$to$predict$preferences$
Input"is"feedback"about"likes"and/or"dislikes"
Output"is"a"list"of"suggested"items"based"on"feedback"received"
!Two$main$types$of$recommenders$
Content/based"
CollaboraCve"ltering"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#5$
Content/Based"Recommenders"
!Content#based$recommenders$consider$an$items$aNributes$
These"a=ributes"describe"the"item"
!Examples$of$item$aNributes$
Movies:"actor,"director,"screenwriter,"producer,"and"locaCon"
Music:"songwriter,"style,"musicians,"vocalist,"meter,"and"tempo"
Books:"author,"publisher,"subject,"illustraCons,"and"page"count"
!A$users$taste$denes$values$and$weights$for$each$aNribute$
These"are"supplied"as"input"to"the"recommender"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#6$
Content/Based"Recommenders"(contd)"
!Content#based$recommenders$are$domain#specic$
Because"a=ributes"dont"transcend"item"types"
!Examples$of$content#based$recommenda?ons$
You"like"1980s"acCon"lms"starring"Chuck"Norris,"try"Delta&Force&
You"like"abstract"rock"from"the"1970s,"try"Dark&Side&of&the&Moon"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#7$
CollaboraCve"Filtering"
!Collabora?ve$ltering$is$an$inherently$social$system$
It"recommends"items"based"on"preferences"of"similar"users"
!Its$similar$to$how$you$get$recommenda?ons$from$friends$
Query"those"people"who"share"your"interests"
Theyll"know"movies"you"havent"seen"and"would"probably"like"
And"youll"be"able"to"recommend"some"to"them"
!This$approach$is$not$domain#specic$
System"doesnt"know"anything"about"the"items"it"recommends"
The"same"algorithm"can"used"to"recommend"any"type"of"product"
!Well$discuss$collabora?ve$ltering$in$detail$during$this$chapter$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#8$
Hybrid"Recommenders"
!Content#based$and$collabora?ve$ltering$are$two$approaches$
!Each$has$advantages$and$limita?ons$
Well"discuss"these"in"a"moment"
!Its$also$possible$to$combine$these$approaches$
For"example,"predict"raCng"using"content/based"approach"
Then"predict"raCng"using"collaboraCve"ltering"
Finally,"average"these"values"to"create"a"hybrid"predicCon"
!Research$demonstrates$that$this$can$oer$beNer$results$than$using$either$
system$on$its$own$
Neflix"and"other"companies"use"hybrid"recommenders"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#9$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types$of$collabora?ve$ltering$
! LimitaCons"of"recommender"systems"
! Fundamental"concepts"
! Review"quesCons"
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#10$
Types"of"CollaboraCve"Filtering"
!Collabora?ve$ltering$can$be$subdivided$into$two$main$types$
!User#based:$What$do$users$similar$to$you$like?$
For"a"given"user,"nd"other"people"who"have"similar"tastes"
Then,"recommend"items"based"on"past"behavior"of"those"users"
!Item#based:$What$is$similar$to$other$items$you$like?$
Given"items"that"a"user"likes,"determine"which"items"are"similar"
Make"recommendaCons"to"the"user"based"on"those"items"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#11$
User/Based"CollaboraCve"Filtering"
!User#based$collabora?ve$ltering$is$social$
It"takes"a"people"rst"approach,"based"on"common"interests"
!In$this$example,$Alice$and$Donna$have$similar$tastes$
Each"is"likely"to"enjoy"a"movie"that"the"other"rated"highly"
Titanic
5
Alice
Donna
Frank
Bob
Eddie
Chuck
Tombstone
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#12$
Item/Based"CollaboraCve"Filtering"
!AZer$examining$more$of$these$ra?ngs,$paNerns$emerge$
Strong"correlaCons"between"movies"suggest"theyre"similar"
Fletch
Twilight
Eddie
5
Bob
Chuck
Donna
Chuck
Alice
Alice
Donna
Bob
Eddie
1
3
Stripes
3
4
Prom
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#13$
Item/Based"CollaboraCve"Filtering"(contd)"
!The$item#based$approach$was$popularized$by$Amazon$
Given"previous"purchases,"what"would"you"be"likely"to"buy?"
!Cloudera$Movies$could$also$use$item#based$ltering$
Suggest"Stripes"aher"customer"adds"Fletch&to"the"queue"
!Item#based$CF$usually$scales$beNer$than$user#based$$
Successful"companies"have"more"users"than"products"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#14$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types"of"collaboraCve"ltering"
! Limita?ons$of$recommender$systems$
! Fundamental"concepts"
! Review"quesCons"
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#15$
LimitaCons"
!The$cold$start$problem$is$a$limita?on$of$collabora?ve$ltering$
CF"nds"recommendaCons"based"on"acCons"of"similar"users"
So"what"do"you"do"for"a"startup?"
A"new"service"has"no"users,"similar"or"otherwise!"
One"workaround"is"to"use"content/based"ltering"at"rst"
Eventually"youll"have"enough"data"for"collaboraCve"ltering"
You"can"transiCon"via"a"hybrid"approach"as"you"add"users"
!Performance$of$sparse$matrix$opera?ons$
Cloudera"Movies"has"14"million"customers"and"100,000"movies"
A"matrix"representaCon"will"have"1.4"trillion"elements"
Even"acCve"customers"have"only"seen"a"few"hundred"movies"
And"they"havent"rated"all"of"these"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#16$
LimitaCons"(contd)"
!People$arent$very$good$at$ra?ng$things$
You"may"need"to"idenCfy"and"correct"for"individual"biases"
Observe"user"behavior"instead"of"asking"for"raCngs"
!Individual$tastes$arent$always$predictable$
One"person"may"love"Halloween,"Friday&the&13th,"and"Saw&
Unlike"similar"users,"this"person"may"also"love"Mary&Poppins&"
As"always,"using"more"input"data"will"likely"produce"be=er"results"
!A$single$account$may$correspond$to$mul?ple$users$
Does"the"account"holder"like"Bambi?""Or"is"it"her"daughter?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#17$
LimitaCons"(contd)"
!Item#based$CF$may$predict$previously#sa?sed$needs$
The"goal"of"item/based"CF"is"to"idenCfy"similar"products"
More"helpful"with"pre/purchase"suggesCons"than"post/purchase"
If"I"bought"a"toaster,"ads"for"other"toasters"arent"helpful"
But"ads"for"bagels"and"jam"might"be"helpful"
Not"an"issue"for"some"products"(like"movies"or"music)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#18$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types"of"collaboraCve"ltering"
! LimitaCons"of"recommender"systems"
! Fundamental$concepts$
! Review"quesCons"
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#19$
Input"Data"
!The$recommender$accepts$preference$data$as$input$
These"preferences"represent"what"users"like"and"dislike"
Content/based"recommenders"also"use"a=ributes"about"an"item"
!Input$preferences$can$be$collected$in$two$ways$
Explicit:"we"ask"users"to"rate"items"that"they"like"or"dislike"
Neflix"star"raCngs"
TiVO"thumbs"up"raCngs"
How"would"you"rank"these"items?"
Implicit:"we"observe"user"behavior"to"determine"their"preferences"
Which"movies"does"a"customer"watch?"
Does"customer"move"a"movie"up"or"down"in"the"queue?"
Does"the"customer"nish"the"movie?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#20$
EvaluaCng"Input"
!How$does$collabora?ve$ltering$work?$
Create"a"matrix"of"users"and"items,"populated"with"preferences"
For"a"given"user,"idenCfy"other"users"with"similar"tastes"
Find"items"new"to"this"user,"but"rated"highly"by"similar"users"
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$
Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#21$
EvaluaCng"Input"(contd)"
!Donna$has$preferences$similar$to$Alice$
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$ Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#22$
EvaluaCng"Input"(contd)"
!Based$on$this,$we$could$recommend$Eat$Pray$Love$to$Alice$
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$ Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#23$
EvaluaCng"Input"(contd)"
!Similarly,$we$could$we$recommend$Jane$Eyre$to$Donna$
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$ Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#24$
EvaluaCng"Input"(contd)"
!More$users$mean$stronger$signals$and$beNer$recommenda?ons$
Whose"preferences"are"similar"to"Bob?"
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$ Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#25$
EvaluaCng"Input"(contd)"
!Both$Eddie$and$Ginas$preferences$are$similar$to$Bob$
RaCngs"they"share"produce"be=er"recommendaCons"for"Bob"
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$ Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#26$
EvaluaCng"Input"(contd)"
!We$could$recommend$Gunsmoke,$Karate$Kid,$or$Iron$Man$to$Bob$
Highest"condence"about"Iron"Man,"based"on"stronger"signal"
Alice$
Bob$
Airplane$
1$
4$
Bambi$
4$
Caddyshack$
Chuck$ Donna$
Eddie$
Dracula$
3$
2$
4$
5$
Eat$Pray$Love$
2$
Friday$
4$
5$
1$
5$
Iron$Man$
3$
1$
4$
5$
5$
3$
The$Karate$Kid$
4$
1$
5$
Hang$Em$High$
5$
5$
4$
Gunsmoke$
Jane$Eyre$
Gina$
5$
5$
4$
Frank$
4$
5$
4$
5$
5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#27$
Basic"Similarity"Metrics"
!Its$easy$for$humans$to$see$similari?es$between$users$
But"how"can"a"computer"nd"these"similariCes?"
More"importantly,"how"we"can"measure"them?"
!There$are$many$similarity$metrics$
Well"briey"cover"two"now,"and"discuss"several"in"depth"later"
!Choosing$one$involves$several$factors,$including$
The"type"of"preference"data"available"
Performance"at"scale"
!They$work$by$comparing$vectors$of$data$
The"elements"could"be"users"or"items"
You"need"to"calculate"metrics"for"every"pair"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#28$
Tanimoto"Coecient"
!Tanimoto$coecient$is$applicable$when$you$have$binary$(boolean)$data$
Did"customer"watch"a"given"movie"or"not?"
Did"customer"nish"this"movie"or"not?"
!Also$known$as$the$Jaccard$coecient,$Tanimoto$compares$two$sets$
Based"on"the"raCo"of"union"(all"items)"and"intersecCon"(common"items)"
Union of both sets
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#29$
Tanimoto"Coecient"(contd)"
!The$Tanimoto$coecient$is$easy$to$compute$in$Python$
def tanimoto(set_a, set_b):
intersection = set_a.intersection(set_b)
len_a = len(set_a)
len_b = len(set_b)
len_i = len(intersection)
return float(len_i) / (len_a + len_b - len_i)
$
!The$value$ranges$between$0.0$and$1.0$
A"value"of"1.0"indicates"both"sets"exactly"match"one"another"
Value"moves"towards"0.0"as"number"of"common"items"decreases"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#30$
Tanimoto"Coecient"(contd)"
!Consider$the$following$input$
An"X"in"the"matrix"below"indicates"customer"watched"the"movie"
Alice$
Airplane$
"
"
"
"
"
"
Frank$
Gina$
X$
X$
Bambi$
X$
X$
Caddyshack$
X$
X$
Eat$Pray$Love$
X$
Gunsmoke$
X$
X$
Hang$Em$High$
X$
X$
!Frank$and$Gina$share$similar$taste$(value$=$0.8)$
!But$Alice$and$Gina$dont$(value$=$0.0)$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#31$
Euclidean"Distance"
!Euclidean$distance$is$a$measure$of$similarity$for$numeric$data$$
How"many"stars"did"the"customer"give"this"movie?"
How"many"Cmes"did"the"customer"watch"this"movie?"
!Eec?vely$the$same$as$plolng$it$and$measuring$with$a$ruler$
Pulp Fiction
5
Bob
4
3
2
Alice
1
Robocop
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#32$
Euclidean"Distance"(contd)"
!Euclidean$distance$is$also$easy$to$calculate$in$Python$
Simple"calculaCon"based"on"parallel"elements"from"each"list"
"
def euclidean(list_a, list_b):
"dist = 0.0
i in range(len(list_a)):
"forrate_a
= list_a[i]
" rate_b = list_b[i]
" dist = dist + pow((rate_a - rate_b), 2)
return sqrt(dist)
"
!A$lower$number$indicates$a$stronger$similarity$
Though"this"is"ohen"inverted"to"provide"a"value"in"the"0.0""1.0"range"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#33$
Euclidean"Distance"(contd)"
!Consider$the$following$input$
Each"element"in"the"matrix"below"is"the"users"raCng"of"a"movie"
"
Alice$
Frank$
Gina$
Airplane$
1$
4$
5$
Bambi$
4$
2$
1$
Caddyshack$
2$$
4$
5$
Eat$Pray$Love$
5$
1$
1$
Gunsmoke$
1$
5$
5$
Hang$Em$High$
1$
4$
5$
!Frank$and$Ginas$preferences$are$close$(distance$of$2.0)$
Alice"and"Ginas"preferences"arent"(distance"of"9.05)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#34$
Recommender"Output"
!Quick$recap$of$how$a$user#based$recommender$works$
Takes"preference"data"as"input"
It"nds"similar"users"based"on"similarity"metrics"
!What$does$a$recommender$produce$as$output?$
A"list"of"items"along"with"the"predicted"raCngs"for"each"
!What$do$we$do$with$this$output?$
Remove"items"known"to"be"of"li=le"value"
Sort"remaining"items"in"descending"order"of"predicted"raCng"
Present"this"to"the"user"in"the"applicaCon"
Hi,$Bob.$$We$saw$how
much$you$liked$The$
Godfather.$
We$think$you'll$also$
enjoy$Goodfellas$and$
Casino.
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#35$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types"of"collaboraCve"ltering"
! LimitaCons"of"recommender"systems"
! Fundamental"concepts"
! Review$ques?ons$
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#36$
Review"QuesCons"
!What$are$some$ways$in$which$Cloudera$Movies$might$gather$implicit$
preferences$from$our$customers?$
!How$might$Cloudera$Movies$learn$which$movie$aNributes$are$important$
to$a$customer?$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#37$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types"of"collaboraCve"ltering"
! LimitaCons"of"recommender"systems"
! Fundamental"concepts"
! Review"quesCons"
! Hands#On$Exercise:$Implemen?ng$a$Basic$Recommender$
! EssenCal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#38$
Hands/on"Exercise:"Basic"Recommender"
!In$this$Hands#On$Exercise,$you$will$build$a$simple$recommender$system$in$
Python$using$the$techniques$youve$just$learned$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instruc?ons$on$exercise$
#5$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#39$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types"of"collaboraCve"ltering"
! LimitaCons"of"recommender"systems"
! Fundamental"concepts"
! Review"quesCons"
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! Essen?al$points$
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#40$
EssenCal"Points"
!Recommenders$are$ltering$systems$
!Content#based$recommenders$consider$item$aNributes$
!Collabora?ve$lters$consider$ac?ons$of$other$users$
!Preferences$can$be$collected$implicitly$or$explicitly$
!Similarity$metrics$are$chosen,$in$part,$based$on$data$type$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#41$
Chapter"Topics"
Recommender$Overview$
! What"is"a"recommender"system?"
! Types"of"collaboraCve"ltering"
! LimitaCons"of"recommender"systems"
! Fundamental"concepts"
! Review"quesCons"
! Hands/On"Exercise:"ImplemenCng"a"Basic"Recommender"
! EssenCal"points"
! Conclusion$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#42$
Recommender"Overview"
In$this$chapter$you$have$learned$
!What$is$the$dierence$between$content#based$and$collabora?ve$ltering$
recommender$systems$
!Which$limita?ons$recommender$systems$frequently$encounter$
!How$collabora?ve$ltering$can$iden?fy$similar$users$and$items$
!How$Tanimoto$and$Euclidean$distance$similarity$metrics$work$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#43$
Bibliography"
The$following$oer$more$informa?on$on$topics$discussed$in$this$chapter$
!Recommender$Systems:$An$Introduc?on$
http://tiny.cloudera.com/dscc10a
!Amazons$Original$Item#Item$Recommenda?ons$Paper$
http://tiny.cloudera.com/dscc10b
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
10#44$
Introduc@on"to"Apache"Mahout"
Chapter"11"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"1#
Course"Chapters"
! Introduc@on"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"Acquisi@on"
! Evalua@ng"Input"Data"
! Data"Transforma@on"
! Data"Analysis"and"Sta@s@cal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! Introduc-on#to#Apache#Mahout#
! Implemen@ng"Recommenders"with"Apache"Mahout"
! Experimenta@on"and"Evalua@on"
! Produc@on"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"2#
Introduc@on"to"Apache"Mahout"
In#this#chapter#you#will#learn#
!What#Apache#Mahout#is#
!How#it#was#developed#
!Whats#required#to#run#Apache#Mahout#
!How#you#can#use#Mahout#for#Collabora-ve#Filtering#
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"3#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What#Apache#Mahout#is#(and#is#not)#
! Brief"history"
! Availability"and"installa@on"
! Review"ques@ons"
! Op@onal"Demo:"Using"Mahouts"Item/Based"Recommender"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"4#
What"is"Apache"Mahout?"
!Apache#Mahout#is#an#open#source#machine#learning#library#
Focuses"on"real"world"use"cases,"not"academic"ones"
Many"of"its"algorithms"can"use"Hadoop"for"improved"scalability"
!Mahout#is#derived#from#the#Hindi#word#for#elephant#driver#
Theres"some"debate"about"pronuncia@on"
!Mahout#has#extensive#support#for#collabora-ve#ltering#
Handles"both"user/based"and"item/based"approaches"
Includes"Tanimoto,"Euclidean"and"many"more"similarity"metrics"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"5#
What"is"Apache"Mahout?"(contd)"
!Mahout#is#a#collec-on#of#algorithms#
Mainly"focused"on"The"Three"Cs"of"machine"learning"
!It#also#provides#some#helpful#u-lity#classes#
Format"conversion"and"basic"visualiza@on"tools"
!Mahout#is#implemented#in#Java#
You"can"extend"it"by"wri@ng"Java"code"that"uses"its"API"
But"you"dont"have"to"be"a"Java"programmer"to"use"it"
You"can"invoke"it"from"the"command"line"(well"see"how)"
Could"also"make"it"accessible"via"a"Web"service"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"6#
What"Apache"Mahout"Is"Not"
!A#turnkey#solu-on#for#content"based#recommenders#
Though"you"can"use"Mahout"to"help'you"build"one"
!A#cohesive#system#
Mahout"is"an"umbrella"project"with"many"subcomponents"
These"were"contributed"by"dierent"people"over"@me"
Developers"work"independently"on"their"preferred"approaches"
Documenta@on"isnt"always"consistent"
!Always#user"friendly#
Its"meant"for"developers,"not"end"users"
Can"be"nicky"(lecover"temp"les"may"cause"jobs"to"fail)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"7#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What"Apache"Mahout"is"(and"is"not)"
! Brief#history#
! Availability"and"installa@on"
! Review"ques@ons"
! Op@onal"Demo:"Using"Mahouts"Item/Based"Recommender"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"8#
Brief"History"of"Apache"Mahout"
!Sean#Owen#started#the#Taste#CF#project#in#2005#
!Separately,#Lucene#developers#were#inspired#by#a#2006#paper#
Map,Reduce'for'Machine'Learning'on'Mul6core'
!Mahout#created#as#a#sub"project#of#Lucene#in#January,#2008#
Taste"and"Mahout"merged"in"April,"2008"
Mahout"became"a"top/level"Apache"project"in"2010"
Taste project created
Multicore ML paper
Lucene subproject
Taste donated
2005
2006
2007
2008
2009
Mahout graduates
to top-level project
2010
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"9#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What"Apache"Mahout"is"(and"is"not)"
! Brief"history"
! Availability#and#installa-on#
! Review"ques@ons"
! Op@onal"Demo:"Using"Mahouts"Item/Based"Recommender"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"10#
Prerequisites"
!Like#Hadoop#itself,#Mahout#minimally#requires#two#things#
A"Unix/like"opera@ng"system"(Linux"is"most"common)"
Java"virtual"machine,"version"1.6"
!You#almost#certainly#want#to#use#Hadoop#as#well#
Strictly"speaking,"its"not"a"requirement"
Many"algorithms"can"run"in"parallel"with"MapReduce"
Version"of"Hadoop"needed"varies"by"Mahout"version"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"11#
Availability"
!Mahout#is#available#via#its#Web#site:#h]p://mahout.apache.org/#
Need"Maven"and"Ant"to"build"from"source"
!Its#also#part#of#Clouderas#Distribu-on#including#Apache#Hadoop#(CDH)#
CDH"is"both"free"and"open"source"
We"x"produc@on"issues"and"contribute"our"patches"to"Apache"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"12#
Installa@on"
!To#install#the#binary#version#downloaded#from#Apaches#site#
Verify"checksum"
Unpack"archive"
Move"newly/created"directory"to"desired"loca@on"
Add"its"bin"subdirectory"to"PATH"environment"variable""
Make"sure"Hadoops"bin"subdirectory"is"also"in"your"$PATH"
!Make#sure#the#JAVA_HOME#environment#variable#is#set#
!Your#virtual#machine#already#contains#Mahout#as#part#of#CDH#
Installa@on"took"only"one"command"
$ sudo yum install mahout
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"13#
Local"Mode""
!Mahout#will#a]empt#to#distribute#its#workload#with#Hadoop,#if#possible#
!For#this#to#work,#two#things#must#be#true#
Hadoop"must"be"installed"and"properly"congured"
Hadoops"bin"subdirectory"must"be"in"your"$PATH""
!If#Hadoop#is#not#available,#Mahout#will#run#in#local#mode#
!You#can#also#use#local#mode#even#when#Hadoop#is#available#
Set"the"MAHOUT_LOCAL"environment"variable"to"any"value"
$ export MAHOUT_LOCAL=true
$ mahout myoptions ... # run Mahout as desired
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"14#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What"Apache"Mahout"is"(and"is"not)"
! Brief"history"
! Availability"and"installa@on"
! Review#ques-ons#
! Op@onal"Demo:"Using"Mahouts"Item/Based"Recommender"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"15#
Review"Ques@ons"
!Which#of#The#Three#Cs#of#Machine#Learning#does#Mahout#support?#
!What#does#Mahouts#early#history#have#in#common#with#Hadoops?#
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"16#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What"Apache"Mahout"is"(and"is"not)"
! Brief"history"
! Availability"and"installa@on"
! Review"ques@ons"
! Op-onal#Demo:#Using#Mahouts#Item"Based#Recommender#
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"17#
Op@onal"Demo:"Overview"
!Input#is#a#comma"delimited#list#of#user#IDs,#movie#IDs#and#ra-ngs#
!How#to#run#item"based#recommender#from#command"line#
$ mahout recommenditembased \
--input /clouderamovies/demoratings.csv \
--tempDir /tmp/mahoutdemo \
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE \
--output /clouderamovies/demoresults
!Outputs#a#list#of#movies#Mahout#recommends#for#each#user#
Format:"user_id [predicted_rating:movieid, ...]
!Time#permidng,#your#instructor#will#now#demonstrate#this#
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"18#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What"Apache"Mahout"is"(and"is"not)"
! Brief"history"
! Availability"and"installa@on"
! Review"ques@ons"
! Op@onal"Demo:"Using"Mahouts"Item/Based"Recommender"
! Essen-al#points#
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"19#
Essen@al"Points"
!Apache#Mahout#is#an#open#source#machine#learning#library#
Its"really"a"collec@on"of"various"implementa@ons"and"u@li@es"
!Mahout#can#use#Hadoop#for#be]er#performance#and#scalability#
Though"not"all"of"Mahouts"algorithms"currently"support"this"
!Mahout#supports#collabora-ve#ltering#through#Taste#
!Using#Mahout#doesnt#necessarily#require#wri-ng#any#code#
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"20#
Chapter"Topics"
Introduc-on#to#Apache#Mahout#
! What"Apache"Mahout"is"(and"is"not)"
! Brief"history"
! Availability"and"installa@on"
! Review"ques@ons"
! Op@onal"Demo:"Using"Mahouts"Item/Based"Recommender"
! Essen@al"points"
! Conclusion#
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"21#
Introduc@on"to"Apache"Mahout"
In#this#chapter#you#have#learned#
!What#Apache#Mahout#is#
!How#it#was#developed#
!Whats#required#to#run#Apache#Mahout#
!How#you#can#use#Mahout#for#Collabora-ve#Filtering#
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"22#
Bibliography"
The#following#oer#more#informa-on#on#topics#discussed#in#this#chapter#
!The#Apache#Mahout#Web#site#
http://tiny.cloudera.com/dscc11a
!Mahout#in#Ac-on#
http://tiny.cloudera.com/dscc11b
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
11"23#
ImplemenAng"Recommenders""
with"Apache"Mahout"
Chapter"12"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#1$
Course"Chapters"
! IntroducAon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiAon"
! EvaluaAng"Input"Data"
! Data"TransformaAon"
! Data"Analysis"and"StaAsAcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducAon"to"Apache"Mahout"
! Implemen+ng$Recommenders$with$Apache$Mahout$
! ExperimentaAon"and"EvaluaAon"
! ProducAon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#2$
ImplemenAng"Recommenders"with"Apache"Mahout"
In$this$chapter$you$will$learn$
!How$several$common$similarity$metrics$work$
!How$data$type,$magnitude,$and$ra+ng$bias$can$inuence$your$choice$of$a$
similarity$metric$
!What$scoring$is$in$the$context$of$recommender$systems$
!How$modifying$your$scoring$factors$can$make$your$recommender$more$
aligned$with$business$interests$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#3$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview$
! Similarity"metrics"for"binary"preferences"
! Similarity"metrics"for"numeric"preferences"
! Scoring"
! Review"quesAons"
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#4$
GeneraAng"RecommendaAons"with"Mahout"
!Two$main$ways$to$generate$recommenda+ons$with$Mahout$
WriAng"Java"code"that"invokes"Mahouts"APIs"
Using"the"command"line"recommender"
!Were$going$to$focus$on$the$laKer$approach$
!The$mahout$command$has$an$item#based$recommender$$
The"following"is"a"basic"example"of"how"to"invoke"it"
If"Mahout"is"part"of"a"Hadoop"cluster,"all"le"paths"are"in"HDFS"
$ mahout recommenditembased \
--input /clouderamovies/ratings.csv \
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#5$
Mahouts"Command/Line"Recommender"Input"
!The$input$is$a$comma#delimited$le$of$preference$data$
User"ID,"Item"ID,"and"(opAonally)"that"users"raAng"of"the"item"
Binary"preferences"will"not"have"an"associated"raAng"value"
$ head -n5 ratings.csv
1,168,5
1,172,3
1,165,1
1,156,4
1,196,2
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#6$
Mahouts"Command/Line"Recommender"Output"
!The$result$of$your$Mahout$job$will$be$a$text$le$
As"with"input,"it"will"generally"be"in"HDFS"
!Each$line$in$this$le$represents$ra+ngs$for$a$given$user$
First"eld"is"the"user"ID"
Second"eld"is"an"ordered"list"of"recommended"items"
Each"element"is"a"tuple"of"item"ID"and"predicted"raAng"
$ hadoop fs -getmerge /clouderamovies/myrecs results.txt
$ head -n3 results.txt
1
[399:5.0,251:4.8,159:4.3,217:4.2,356:4.1]
2
[387:4.9,332:4.7,249:4.6,282:4.5,297:4.3]
3
[345:5.0,236:4.9,253:4.7,394:4.4,427:4.3]
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#7$
LimiAng"Results"
!Collabora+ve$ltering$can$involve$processing$a$lot$of$data$
Consequently,"Mahout"jobs"can"take"a"long"Ame"to"nish"
!What$if$you$only$want$to$consider$certain$users$or$items?$
You"could"edit"the"input"data"to"exclude"everything"else"
!A$beKer$op+on$is$to$use$Mahouts$ltering$support$
Can"limit"by"user,"by"item,"or"both"
File"is"simply"a"list"of"IDs,"one"per"line"
$ mahout recommenditembased \
--input /clouderamovies/ratings.csv \
--usersFile /clouderamovies/userlist.txt \
--itemsFile /clouderamovies/itemlist.txt \
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#8$
Similarity"Metrics"on"the"Command"Line"
!Mahout$supports$a$number$of$similarity$metrics$
Which"one"you"should"select"depends"on"your"data"
An"important"choice"that"aects"accuracy"and"performance"
!Which$of$these$to$use$is$specied$on$the$command$line$
Run"the"mahout recommenditembased"command"to"see"a"list"
Well"now"discuss"several"of"the"most"common"ones"
$ mahout recommenditembased \
--input /clouderamovies/ratings.csv \
--usersFile /clouderamovies/userlist.txt \
--itemsFile /clouderamovies/itemlist.txt \
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#9$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity$metrics$for$binary$preferences$
! Similarity"metrics"for"numeric"preferences"
! Scoring"
! Review"quesAons"
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#10$
Binary"Preference"Data"
!The$simplest$type$of$preference$data$is$binary$
There"are"no"raAngs"at"all"
!Implicit$feedback$oWen$results$in$binary$preference$data$$
Whether"or"not"a"customer"bought"a"product"
Whether"or"not"someone"watched"a"movie"
Whether"or"not"a"user"has"a"given"connecAon"in"a"social"network"
Whether"or"not"a"customer"shops"at"a"parAcular"store"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#11$
RepresenAng"Binary"Preference"Data"
!The$input$passed$to$Mahout$is$a$CSV$le$of$user$and$item$IDs$
$ head -n5 binaryprefs.csv
1,217
1,318
1,262
2,347
2,294
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#12$
Tanimoto"Coecient"
!The$Tanimoto$coecient$is$based$on$intersec+on$of$two$sets$
It"measures"the"proporAon"of"shared"elements"to"total"elements"
A"value"of"0.0"indicates"that"no"elements"are"shared"
A"value"of"1.0"indicates"that"all"elements"are"shared"
!Example$of$how$to$run$Mahout$with$Tanimoto$similarity$
Always"use"the"booleanData"ag"with"binary"preferences"
$ mahout recommenditembased \
--input /clouderamovies/binaryprefs.csv \
--similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
--booleanData \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#13$
LimitaAons"of"Tanimoto"Coecient"
!Tanimoto$aKempts$to$measure$what$users$have$in$common$
!For$example,$Alice$and$Bob$have$each$seen$ve$movies$
Alice"has"seen"three"that"Bob"hasnt"
Bob"has"seen"three"that"Alice"hasnt"
They"have"two"movies"in"common"
!It$seems$Alice$and$Bob$have$slightly$similar$taste$in$movies$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#14$
LimitaAons"of"Tanimoto"Coecient"(contd)"
!Lets$look$at$Alice$and$Bobs$movies$more$closely$
$
Alice$Only$
Shared$
Bob$Only$
Beaches"
Forrest"Gump"
The"Texas"Chainsaw"Massacre"
Ghost"
Back"to"the"Future"
A"Nightmare"on"Elm"Street"
Pre=y"Woman"
Saw"
!Do$Alice$and$Bob$really$have$much$in$common?$
Its"more"likely"that"this"overlap"is"coincidental"
!How$can$we$improve$our$ra+ngs$in$situa+ons$like$this?$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#15$
Log"Likelihood"
!The$log$likelihood$similarity$metric$handles$this$situa+on$
It"takes"the"staAsAcal"likelihood"of"coincidence"into"account"
!Log$likelihood$analyzes$four$key$values$
How$many$watched$X?$
How$many$did$not$watch$X?$
How$many$watched$Y?$
"
#"who"watched"both"X"and"Y"
#"who"watched"Y"but"not"X"
How$many$did$not$watch$Y?$
#"who"watched"X"but"not"Y"
#"who"watched"neither"X"nor"Y"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#16$
Log"Likelihood"(contd)"
!Alice$and$Bob$will$have$low$similarity$with$log$likelihood$
The"fact"that"they"saw"the"same"two"movies"is"inconsequenAal"
Those"are"popular"movies""everyone"else"saw"them"too"
Alice$
Bob$
Chuck$
Donna$
Eddie$
Beaches$
X$
Ghost$
X$
X$
PreKy$Woman$
X$
X$
Forrest$Gump$
X$
X$
X$
X$
X$
Back$to$the$Future$
X$
X$
X$
X$
X$
A$Nightmare$on$Elm$Street$
X$
Texas$Chainsaw$Massacre$
X$
Saw$
X$
X$
X$
X$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#17$
Log"Likelihood"(contd)"
!On$the$other$hand,$Alice$and$Donna$will$have$a$high$score$
Their"similariAes"are"less"common"and"more"meaningful"
$
Alice$
Bob$
Chuck$
Donna$
Eddie$
Beaches$
X$
Ghost$
X$
X$
PreKy$Woman$
X$
X$
Forrest$Gump$
X$
X$
X$
X$
X$
Back$to$the$Future$
X$
X$
X$
X$
X$
A$Nightmare$on$Elm$Street$
X$
Texas$Chainsaw$Massacre$
X$
Saw$
X$
X$
X$
X$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#18$
Log"Likelihood"(contd)"
!Log$likelihood$is$usually$more$accurate$than$Tanimoto,$given$the$same$
data$
Especially"when"theres"a"lot"of"input"data"to"analyze"
!Example$of$how$to$run$Mahout$with$log$likelihood$similarity$
$ mahout recommenditembased \
--input /clouderamovies/binaryprefs.csv \
--similarityClassname SIMILARITY_LOGLIKELIHOOD \
--booleanData \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#19$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity"metrics"for"binary"preferences"
! Similarity$metrics$for$numeric$preferences$
! Scoring"
! Review"quesAons"
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#20$
Numeric"Preference"Data"
!Numeric$preferences$convey$signal$strength$
How"much"did"the"user"like"(or"dislike)"an"item?"
!This$is$typically$based$on$explicit$feedback$
How"did"the"customer"rate"the"movie"on"a"scale"of"1"to"5?"
!It$can$also$be$based$on$implicit$feedback$
How"many"Ames"did"the"customer"watch"a"given"movie?"
How"much"did"customer"spend"on"related"merchandise?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#21$
RepresenAng"Numeric"Preference"Data"
!The$input$is$a$comma#delimited$le$of$preference$data$
User"ID,"item"ID,"and"that"users"raAng"of"the"item"
$ head -n3 example01.csv
1,168,5
1,172,3
1,165,1
"
!The$format$and$scale$of$these$ra+ng$values$may$vary$
RaAngs"may"be"forma=ed"as"integer"or"decimal"values"
Need"not"be"limited"to"the"range"of"1/5"
$ head -n3 example02.csv
1,168,55.36
1,172,37.41
1,165,19.38
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#22$
Euclidean"Distance"
!Euclidean$distance$is$a$simple$metric$for$numeric$data$
The"same"as"youd"measure"with"a"ruler"and"chart"
Alice
5
Pulp Fiction
4
3
2
Robocop
1
Bob
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#23$
Euclidean"Distance"(contd)"
!Example$of$how$to$run$Mahout$with$Euclidean$distance$similarity$
$ mahout recommenditembased \
--input /clouderamovies/ratings.csv \
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#24$
The"Importance"of"Magnitude"
!Lets$imagine$that$were$going$to$track$the$number$of$+mes$each$
customer$has$watched$a$movie$with$a$given$actor$
!Note$that$the$scale$is$dierent$in$the$second$plot$
Values"are"proporAonal,"but"Euclidean"distance"diers"greatly"
Euclidean distance = 3.0
JFK
5
Lincoln
50
Bob
40
30
20
Alice
Bob
Alice
10
Star Trek
Tron
10
20
30
40
50
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#25$
Cosine/Based"Similarity"Metrics"
!Cosine#based$metrics$are$based$on$angles$rather$than$distance$
The"points"being"measured"form"a"triangle"with"the"origin"
This"approach"discounts"magnitude"when"determining"similarity"
Well"discuss"two"types"of"cosine/based"metrics"
JFK
Lincoln
50
5
Bob
40
Alice
Bob
30
20
10
Alice
Tron
Star Trek
1
10
20
30
40
50
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#26$
Cosine"Similarity"
!The$more$basic$of$these$is$called$cosine$similarity$in$Mahout$
But"would"more"accurately"be"called"uncentered"cosine"
Points"are"always"triangulated"to"the"origin,"as"we"just"saw"
!Example$of$how$to$run$Mahout$with$cosine$similarity$
$ mahout recommenditembased \
--input /clouderamovies/ratings.csv \
--similarityClassname SIMILARITY_COSINE \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#27$
DistribuAon"of"RaAngs"
!As$we$previously$discussed,$ra+ngs$are$not$evenly$distributed$
Distribution of Movie Ratings
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#28$
Bias"in"RaAngs"
!Ra+ngs$are$also$not$consistent$across$users$
Alice"says,"Jaws"was"a"good"movie"so"I"will"give"it"a"2"
She"also"says,"I"didnt"like"Star"Wars,"so"Ill"give"it"a"1"
Donna"assigns"5"to"movies"she"likes"and"4"to"ones"she"doesnt"
Alice
Donna
5
4
3
2
1
Jaws
Star Wars
Traffic
The Abyss
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#29$
Bias"in"RaAngs"(contd)"
!Bias$can$cause$misleading$comparisons$
!We$see$that$Bob$and$Alice$(who$tend$to$rate$low)$are$similar$
But"we"might"miss"the"similarity"between"Alice"and"Donna""
Star Wars
Chuck
Donna
4
3
2
Bob
Alice
Jaws
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#30$
Bias"and"Uncentered"Cosine"Similarity"
!Uncentered$cosine$similarity$does$not$handle$bias$well$
Distance"from"the"origin"aects"the"angle"
This"makes"accurate"comparisons"dicult"
Star Wars
Chuck
Donna
4
3
2
Bob
Alice
Jaws
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#31$
Overcoming"Bias"in"RaAngs"
!A$centered$cosine$similarity$metric$can$overcome$this$problem$
!Instead$of$triangula+ng$to$the$origin$
Calculate"the"mean"of"the"values"being"compared"
Plot"this"mean"value"and"then"base"the"triangulaAon"around"it"
Star Wars
Chuck
Donna
4
3
2
Bob
Alice
Jaws
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#32$
Overcoming"Bias"in"RaAngs"(contd)"
!This$allows$more$accurate$comparisons$despite$bias$
We"now"see"that"Alice"and"Donna"are"similar"
Its"also"clear"that"Bob"and"Chuck"are"similar"too"
Star Wars
Star Wars
Donna
Chuck
2
Bob
Alice
Jaws
1
Jaws
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#33$
Pearson"CorrelaAon"
!Another$cosine#based$similarity$metric$supported$by$Mahout$
Pearson"correlaAon"centers"about"the"mean"to"overcome"bias"
!Pearson$correla+on$isnt$ideal$when$many$ra+ngs$overlap$
It"does"not"consider"number"of"shared"items"
Theres"no"standard"deviaAon"when"both"raAngs"are"equal"
!Example$of$how$to$run$Mahout$with$Pearson$correla+on$similarity$
$
$ mahout recommenditembased \
--input /clouderamovies/ratings.csv \
--similarityClassname SIMILARITY_PEARSON_CORRELATION \
--output /clouderamovies/myrecs
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#34$
Advice"on"Choosing"a"Similarity"Metric"
!Which$similarity$metric$should$you$choose?$
It"largely"depends"on"your"preference"data"
!If$you$have$binary$data,$log$likelihood$is$typically$the$best$choice$
!If$you$have$numeric$data$
And"magnitude"is"important,"consider"Euclidean"distance"
And"you"need"to"overcome"raAngs"bias,"try"Pearson"correlaAon"
And"you"have"many"overlapping"values,"use"uncentered"cosine"
!Its$usually$best$to$experiment$and$compare$several$such$metrics$
Well"do"exactly"this"during"the"hands/on"exercise"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#35$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity"metrics"for"binary"preferences"
! Similarity"metrics"for"numeric"preferences"
! Scoring$
! Review"quesAons"
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#36$
What"is"Scoring?"
!We$have$thus$far$focused$on$a$single$piece$of$explicit$feedback$
How"a"given"user"rated"a"given"movie"
!However,$the$term$ra'ng$is$not$synonymous$with$inputRecommenders"olen"combine"many"types"of"feedback"
!Dierent$types$of$feedback$are$signals$of$preference$
For"example,"a"user"may"rate"a"movie"aler"watching"it"
She"might"instead"post"a"message"to"a"social"media"site"about"it"
She"might"both"rate"the"movie"and"post"a"message"about"it"
!Scoring$refers$to$the$value$we$place$on$these$signals$
Its"how"important"we"consider"them"relaAve"to"each"other"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#37$
ConsideraAons"for"Scoring"
!Scoring$is$an$essen+al$part$of$the$value$a$data$scien+st$provides$
Its"olen"more"important"than"algorithm"or"similarity"metric"
!Scoring$is$an$ongoing$considera+on$for$recommender$systems$
Its"the"result"of"constant"experimentaAon"
Must"react"to"changing"condiAons"and"new"opportuniAes"
!It$almost$always$requires$knowledge$of$the$domain$
We"must"know"whats"relevant"to"the"business"and"the"customer"
!Scoring$eec+vely$can$have$a$profound$eect$on$the$boKom$line$
Its"olen"a"balance"between"relevance"and"prots"
Users"want"relevant"suggesAons"
Business"wants"higher"revenue"and"lower"costs"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#38$
Examples"of"PotenAal"Scoring"Criteria"
!Scoring$oWen$makes$extensive$use$of$implicit$feedback$
!It$may$also$consider$revenue,$costs,$and$other$business$factors$
Selling"price"
Wholesale"cost"
Amount"of"item"currently"in"inventory"
!Cloudera$Movies$might$score$based$on$the$following$criteria$
%$of$score$ Descrip+on$of$Criteria$
42%" How"customer"rated"similar"movies"
28%" Cost"of"royalty"payment"required"to"show"this"movie"
17%" Whether"the"user"searched"for"this"movie"by"name"
8%" How"long"this"movie"has"been"in"the"customers"queue"
5%" Popularity"of"this"movie"among"others"in"customers"region"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#39$
IntegraAng"Scores"into"Mahout"
!Recall$the$format$of$the$input$we$provide$to$Mahout$
$ head -n5 ratings.csv
1,168,5
1,172,3
1,165,1
1,156,4
1,196,2
!How$do$we$incorporate$scoring$into$this$data?$
Simply"write"code"that"incorporates"all"your"scoring"logic"
Your"program"will"create"output"similar"to"whats"seen"above"
Your"programs"output"will"be"the"input"to"Mahout"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#40$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity"metrics"for"binary"preferences"
! Similarity"metrics"for"numeric"preferences"
! Scoring"
! Review$ques+ons$
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#41$
Review"QuesAons"
!Which$metrics$would$you$consider$for$numeric$preferences?$
Which"of"those"we"discussed"accounts"for"raAngs"bias?"
!What$scoring$criteria$do$you$think$would$be$important$for$a$recommender$
used$by$a$rental$car$agency$to$suggest$vehicles?$$How$about$a$
recommender$used$by$an$online$jeweler?$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#42$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity"metrics"for"binary"preferences"
! Similarity"metrics"for"numeric"preferences"
! Scoring"
! Review"quesAons"
! Hands#On$Exercise:$Comparison$of$Similarity$Metrics$
! EssenAal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#43$
Hands/on"Exercise:"Similarity"Metrics"
!In$this$Hands#On$Exercise,$you$will$experiment$with$dierent$similarity$
metrics$and$observing$the$results$they$produce$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instruc+ons$on$exercise$
#6$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#44$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity"metrics"for"binary"preferences"
! Similarity"metrics"for"numeric"preferences"
! Scoring"
! Review"quesAons"
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! Essen+al$points$
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#45$
EssenAal"Points"
!Mahout$supports$several$similarity$metrics$
!Each$similarity$metric$has$limita+ons.$$The$values$of$your$preference$data$
generally$dictate$which$metric$is$the$best$choice.$
!We$indicate$the$rela+ve$importance$of$various$criteria$of$our$
recommender$input$through$scoring$
!Scoring$is$an$important$func+on$which$requires$domain$knowledge$and$
can$yield$signicant$business$value$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#46$
Chapter"Topics"
Implemen+ng$Recommenders$with$
Apache$Mahout$
! Overview"
! Similarity"metrics"for"binary"preferences"
! Similarity"metrics"for"numeric"preferences"
! Scoring"
! Review"quesAons"
! Hands/On"Exercise:"Comparison"of"Similarity"Metrics"
! EssenAal"points"
! Conclusion$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#47$
ImplemenAng"Recommenders"with"Mahout"
In$this$chapter$you$have$learned$
!How$several$common$similarity$metrics$work$
!How$data$type,$magnitude,$and$ra+ng$bias$can$inuence$your$choice$of$a$
similarity$metric$
!What$scoring$is$in$the$context$of$recommender$systems$
!How$modifying$your$scoring$factors$can$make$your$recommender$more$
aligned$with$business$interests$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
12#48$
ExperimentaBon"and"EvaluaBon"
Chapter"13"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#1$
Course"Chapters"
! IntroducBon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiBon"
! EvaluaBng"Input"Data"
! Data"TransformaBon"
! Data"Analysis"and"StaBsBcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducBon"to"Apache"Mahout"
! ImplemenBng"Recommenders"with"Apache"Mahout"
! Experimenta0on$and$Evalua0on$
! ProducBon"Deployment"and"Beyond"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#2$
ExperimentaBon"and"EvaluaBon"
In$this$chapter$you$will$learn$
!How$tes0ng$can$lead$to$itera0ve$improvements$that$con0nually$benet$
the$organiza0ons$that$conduct$them$
!Why$user$interface$design$is$an$important$component$of$building$and$
deploying$a$recommender$system$
!How$to$determine$if$your$recommender$is$eec0ve$
!What$considera0ons$are$involved$in$experiment$design$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#3$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring$recommender$eec0veness$
! Designing"eecBve"experiments"
! ConducBng"an"eecBve"experiment"
! User"interfaces"for"recommenders"
! Review"quesBons"
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! EssenBal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#4$
Dening"Recommender"EecBveness"
!You$now$know$how$to$build$a$recommender$system$
But"how"do"you"know"if"the"recommendaBons"are"eecBve?"
!To$answer$this,$you$must$rst$dene$eec0ve$
SubjecBve"terms"can"be"dicult"to"measure"
We"must"rst"determine"the"accuracy"of"our"recommendaBons"
We"will"then"assess"their"quality"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#5$
CalculaBng"the"Error"
!One$way$to$measure$accuracy$is$to$compare$our$predicted$ra0ngs$to$the$
actual$ra0ngs$given$by$the$customer$
The"dierence"between"these"two"values"is"the"error"
5
Actual rating
Predicted rating
Error
3
2
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#6$
Training"Sets"and"Test"Sets"
!To$calculate$the$error,$we$need$actual$and$predicted$ra0ngs$
How"do"we"get"a"users"raBngs"for"items"he"or"she"hasnt"seen?"
!We$need$to$simulate$this$using$a$training$set$
Divide"your"recommender"input"(randomly)"into"two"sets"
The"training$set$is"used"to"generate"recommendaBons"
The"test$set"is"withheld"from"the"recommender"
All Available
Ratings
Training
Set
Test
Set
Recommender
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#7$
PredicBon"and"Comparison"
!We$now$ask$our$recommender$to$predict$the$ra0ng$for$a$item$that$we$
know$exists$only$in$the$test$set$
We"then"simply"compare"this"raBng"to"the"actual"raBng"found"in"the"
test"set"to"determine"the"error"amount"
!This$process$is$repeated$for$each$item$in$the$test$set$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#8$
RMSE"
!RMSE$is$a$common$evalua0on$metric$based$on$this$dierence$
It"stands"for"Root"Mean"Squared"Error"
!The$formula$for$RMSE$is$simple$
For"each"item,"nd"the"error"between"predicted"and"actual"raBngs"
Square"each"error"
Calculate"the"mean"(average)"of"all"squared"errors"
The"RMSE"is"the"square"root"of"this"mean"value"
!Lower$numbers$indicate$more$accurate$predic0ons$
Actual$Ra0ngs$ Predicted$Ra0ngs$
RMSE$
Case$A$
[2,"4,"3,"1]"
[1,"4,"5,"2]"
1.225"
Case$B$
[2,"4,"3,"1]"
[5,"2,"3,"4]"
2.345"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#9$
LimitaBons"of"RMSE"
!RMSE$is$widely$used$for$evalua0ng$recommender$accuracy$
It"was"used"as"the"metric"for"the"Nedlix"prize"
!However,$it$has$some$important$limita0ons$
Its"not"appropriate"for"binary"(boolean)"preference"data"
It"doesnt"consider"the"order"of"our"recommendaBons"
Its"sensiBve"to"outliers:"a"single"recommendaBon"that"is"o"by"two"
negaBvely"aects"RMSE"more"than"two"that"are"o"by"one"
Actual$Ra0ngs$ Predicted$Ra0ngs$
Sum$of$Errors$
RMSE$
Case$A$
[5,"4]"
[3,"4]"
2"
1.414"
Case$B$
[3,"4]"
[2,"3]"
2"
1.000"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#10$
Cross"ValidaBon"
!The$K#fold$cross$valida0on$approach$can$add$more$rigor$
Can"miBgate"the"risk"of"sampling"error"
!Instead$of$dividing$the$input$data$into$training$and$test$sets$once$
Divide"the"input"data"into"K"equal"parBBons""
One"parBBon"will"be"the"test"set"while"the"rest"are"used"for"training"
Train"the"recommender"using"the"training"set"
Evaluate"the"recommender"using"the"test"set"
Repeat"these"steps"for"each"parBBon"
Then"calculate"the"mean"RMSE"value"from"all"tests"
!This$ul0mately$uses$all$data$for$both$training$and$tes0ng$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#11$
Precision"and"Recall"
!Now$that$weve$evaluated$accuracy,$lets$assess$quality$
This"is"eecBvely"a"measure"of"their"relevance"to"the"user"
!Precision$is$the$ra0o$of$relevant$recommenda0ons$among$all$those$we$
oered$(R$/$N)$
!Recall$is$the$ra0o$of$relevant$recommenda0ons$we$made$among$all$
relevant$recommenda0ons$that$were$possible$(R$/$M)$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#12$
RelaBonship"Between"Precision"and"Recall"
!Precision$and$recall$are$most$oaen$inversely$dependent$
You"could"opBmize"for"recall"by"presenBng"more"suggesBons"
However,"fewer"of"those"suggesBons"would"be"relevant"
1.0
Precision
0.8
0.6
0.4
0.2
0.2
0.4
0.6
Recall
0.8
1.0
!Therefore,$precision$at$N$is$usually$a$more$valuable$metric$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#13$
LimitaBons"of"Precision"and"Recall"
!Although$precision$and$recall$are$important$metrics,$they$also$have$
shortcomings$
!Relevance$and$ra0ngs$are$oaen$at$odds$with$one$another$
A"user"considers"an"item"relevant"if"he"or"she"is"familiar"with"it"
An"unknown"item"is"not"relevant,"even"if"its"a"good"suggesBon"
!Using$precision$and$recall$isnt$scalable$
Removing"per/user"biases"makes"computaBon"expensive"
Cannot"realisBcally"do"this"on"more"than"a"small"subset"of"users"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#14$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing$eec0ve$experiments$
! ConducBng"an"eecBve"experiment"
! User"interfaces"for"recommenders"
! Review"quesBons"
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! EssenBal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#15$
Design"of"Experiments"
!An$experiment$can$help$you$evaluate$dierent$solu0ons$and$choose$
which$will$be$most$worthwhile$for$produc0on$use$
!Overview$of$major$steps$in$experiment$design$
1. Construct"a"hypothesis"
2. Determine"how"to"manipulate"the"independent"variable"
3. Consider"methods"to"control"other"variables"
4. Decide"what"to"measure"and"how"to"collect"results"
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#16$
ConstrucBng"a"Hypothesis"
!The$hypothesis$is$a$statement$of$what$you$expect$to$observe$
!These$are$suggested$by$associa0ons$observed$in$the$data$
Customers"who"watch"acBon"movies"use"our"service"more"
!May$start$out$vague,$but$become$precise$with$further$analysis$
A"hypothesis"must"also"be"something"you"can"test"
We"expect"that"by"suggesBng"10%"more"acBon"movies"to"all"
customers,"those"who"accept"the"recommendaBons"will"spend"an"
average"of"15%"more"Bme"using"our"service"during"the"same"period"
than"customers"who"do"not."
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#17$
Null"Hypothesis"
!A$hypothesis$is$a$statement$of$correla0on$between$two$factors$
Your"experiment"a=empts"to"prove"or"disprove"this"relaBonship"
!Possible$explana0ons$for$behavior$observed$in$an$experiment$
Your"experiment"was"poorly"designed"or"executed"
Your"hypothesis"is"correct"
Random"chance"
!The$null$hypothesis$relates$to$random$chance$
It"asserts"that"no"relaBonship"exists"between"these"factors"
The"null"hypothesis"is"negaBve""and"can"never"be"proven"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#18$
Condence"Intervals"
!Condence$intervals$describe$the$likelihood$of$random$chance$
In"other"words,"the"likelihood"that"the"null"hypothesis"is"correct"
!A$condence$interval$conveys$sta0s0cal$signicance$
!A$95%$condence$interval$means$that$100$iden0cal$experiments$
Would"produce"results"in"the"same"range"95"Bmes"
The"remaining"ve"fall"outside"this"range"due"to"random"chance"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#19$
Variable"ManipulaBon"and"Control"
!The$experiment$must$use$the$independent$variable$as$input$
In"this"case,"the"number"of"acBon"movies"a"customer"watches"
!You$must$also$control$for$other$variables$that$may$aect$results$
Preference"for"(or"against)"specic"actors"
Customers"age,"income"level,"locaBon,"or"gender"
Day"of"week"and"Bme"of"day"
Customers"tendency"to"accept"our"recommendaBons"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#20$
Random"Assignment"and"StraBcaBon"
!Random$assignment$to$groups$helps$to$control$variables$
Important"to"have"a"control,group"for"comparison"
Each"group"except"the"control"group"receives"a"treatment"
!For$Web$sites,$the$typical$random$assignment$is$simple$
Assign"each"user"a"unique"idenBer"(e.g."in"a"browser"cookie)"
Convert"this"ID"to"an"integer"using"a"hash"funcBon"
Calculate"this"integer"(hash"value)"modulo"number"of"groups"
!Stra0ca0on$divides$popula0on$into$homogenous$groups$
This"should"be"done"before"random"assignment"
StraBcaBon"helps"ensure"proper"distribuBon"during"assignment"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#21$
PopulaBon"Size"and"Test"DuraBon"
!You$need$a$large$enough$popula0on$of$test$subjects$
So"that"you"decrease"the"likelihood"of"sampling"error"
Ensures"that"observaBons"arent"the"result"of"personal"quirks"
!Your$test$also$needs$to$run$over$a$long$enough$period$
A"test"that"runs"for"only"two"hours"is"sensiBve"to"Bme"
A"one/day"test"is"sensiBve"to"day"of"week"or"month"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#22$
CollecBng"Data"
!Recommenders$create'data$as$well$as$consume$it$
What"data"will"your"experiment"require?"
What"data"will"your"experiment"produce?"
!How$will$you$collect$it?$
Are"you"capturing"everything"of"value?"
Are"you"collecBng"it"accurately?"
Will"your"collecBon"mechanism"scale?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#23$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing"eecBve"experiments"
! Conduc0ng$an$eec0ve$experiment$
! User"interfaces"for"recommenders"
! Review"quesBons"
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! EssenBal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#24$
The"A/B"Test"
!There$are$many$dierent$types$of$experiments$
!Well$focus$on$one$type:$the$A/B$test$
Divide"populaBon"into"two"groups"
Group"A"sees"the"original"(unmodied)"version"
Group"B"sees"the"test"(modied)"version"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#25$
Run"an"A/A"Test"First"
!An$A/A$test$s0ll$subdivides$users$into$two$groups$
But"both"groups"are"shown"the"original"version"
!What$benet$does$an$A/A$test$provide?$
It"veries"that"your"group"assignment"is"correct"
If"it"is,"you"should"observe"the"same"behavior"from"both"groups"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#26$
Using"A/B"Tests"for"Recommender"Systems"
!For$recommender$systems,$A$and$B$might$be$a$dierence$in$
Similarity"metrics"
CollaboraBve"ltering"algorithms"
Scoring"criteria"
PresentaBon"(user"interface)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#27$
Using"Test"Results"for"Decision"Support"
!In$many$organiza0ons,$decisions$are$made$based$on$HiPPO$
Highest"Paid"Persons"Opinion"(i.e."what"the"boss"says)"
!Successful$organiza0ons$listen$to$the$customer$instead$
Not"necessarily"what"the"customer"says,"but"what"they"do,
They"run"a"test"and"let"the"data"speak"for"itself"
!Our$results$ul0mately$inuence$a$decision$on$ROI$
Does"this"feature"provide"enough"value"to"remain"in"producBon?"
Does"it"warrant"further"investment?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#28$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing"eecBve"experiments"
! ConducBng"an"eecBve"experiment"
! User$interfaces$for$recommenders$
! Review"quesBons"
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! EssenBal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#29$
Recommender"System"User"Interfaces"
!Recommenders$are$back$end$systems$
You"must"provide"some"way"of"displaying"its"output"to"users"
!The$user$interface$for$presen0ng$recommenda0ons$is$important$
It"must"be"intuiBve"for"the"customer"to"use"
!Many$design$factors$can$signicantly$inuence$acceptance$rate$
Size"of"product"icons"
PosiBon"of"recommended"items"relaBve"to"one"another"
Color"of"various"elements"on"the"page"
Text"(font"face,"size,"style)"
!Determining$the$best$design$is$oaen$done$through$experiments$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#30$
PresentaBon"Example"
$$
Our top picks for you
Genre:
Starring:
Directed by:
Musical
Vanessa Hudgens and Zac Efron
Kenny Ortega
Our prediction:
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#31$
Explain"Why"Youre"Recommending"an"Item"
!One$of$these$things$is$not$like$the$other$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#32$
Allow"Users"to"Correct"Inaccurate"Data"
!Removing$misleading$signals$improves$recommenda0on$quality$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#33$
Improving"UI"Through"A/B"Tests"
!The$user$interface$for$solici0ng$ra0ngs$is$also$important$
Youll"get"more"(and"be=er)"raBngs"if"its"easy"to"use"
!The$A/B$process$can$help$you$compare$UI$improvements$
Lets"examine"one"of"Cloudera"Movies"recent"improvements"
!We$observed$that$90%$of$customers$over$the$age$of$65$assigned$a$ra0ng$
of$1$to$every$movie$they$watched$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#34$
Improving"UI"Through"A/B"Tests"(contd)"
!We$wanted$to$inves0gate$what$caused$this$
Were"we"making"poor"recommendaBons?"
Were"they"just"grumpy?"
!We$decided$to$conduct$a$brief$online$survey$of$these$users$
This"quickly"revealed"that"they"had"trouble"with"our"raBngs"UI"
!We$originally$used$this$simple$drop$down$menu$on$our$Web$site$
Your rating:
1
2
3
4
5
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#35$
Improving"UI"Through"A/B"Tests"(contd)"
!Our$rst$A/B$test$showed$the$new$design$reduced$it$to$72%$
Many"users"didnt"understand"what"a"raBng"of"1"meant"
Previous"design"
Your rating:
1
2
3
4
5
New"design"
Your rating:
1 - Hated it
2 - Not Good
3 - OK
4 - Liked it
5 - Loved it
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#36$
Improving"UI"Through"A/B"Tests"(contd)"
!We$thought$some$users$hadnt$understood$what$were$asking$
Our"next"test"showed"only"negligible"improvement"
We"subsequently"reverted"to"the"previous"label"
Previous"design"
Your rating:
1 - Hated it
2 - Not Good
3 - OK
4 - Liked it
5 - Loved it
New"design"
1 - Hated it
2 - Not Good
3 - OK
4 - Liked it
5 - Loved it
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#37$
Improving"UI"Through"A/B"Tests"(contd)"
!Our$next$test$showed$a$major$improvement$(reduced$to$39%)$
Many"users"didnt"choose"the"raBng,"they"just"clicked"submit"
The"new"design"requires"an"explicit"selecBon"
Previous"design"
Your rating:
1 - Hated it
2 - Not Good
3 - OK
4 - Liked it
5 - Loved it
New"design"
Your rating:
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#38$
Improving"UI"Through"A/B"Tests"(contd)"
!Our$current$design$showed$a$further$21%$reduc0on$
Many"of"our"older"users"have"poor"vision"and/or"arthriBs"
This"design"is"easier"to"see"and"requires"less"dexterity"
Previous"design"
Your rating:
New"design"
Your rating:
I liked it
5 - Loved it
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#39$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing"eecBve"experiments"
! ConducBng"an"eecBve"experiment"
! User"interfaces"for"recommenders"
! Review$ques0ons$
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! EssenBal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#40$
Review"QuesBons"
!What$hypothesis$can$you$make$based$on$the$data$youve$seen$in$your$
job?$
!What$is$one$limita0on$of$using$the$RMSE$metric$for$accuracy?$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#41$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing"eecBve"experiments"
! ConducBng"an"eecBve"experiment"
! User"interfaces"for"recommenders"
! Review"quesBons"
! Hands#On$Exercise:$Improving$Recommender$Accuracy$
! EssenBal"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#42$
Hands/on"Exercise:"Improving"Recommender"Accuracy"
!In$this$Hands#On$Exercise,$you$will$gain$prac0ce$improving$the$accuracy$of$
the$recommender$system$using$the$techniques$youve$just$learned$
!Please$refer$to$the$Hands#On$Exercise$Manual$for$instruc0ons$on$exercise$
#7$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#43$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing"eecBve"experiments"
! ConducBng"an"eecBve"experiment"
! User"interfaces"for"recommenders"
! Review"quesBons"
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! Essen0al$points$
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#44$
EssenBal"Points"
!An$experiment$can$help$you$evaluate$dierent$solu0ons$and$choose$
which$will$be$most$worthwhile$for$produc0on$use$
!The$A/B$test$helps$us$compare$the$eec0veness$of$various$treatments$on$
a$popula0on$of$users$
!Successful$organiza0ons$value$data$more$than$opinions$for$making$
decisions$
!The$user$interface$is$an$essen0al$part$of$a$recommender$system$
!RMSE$is$a$common$way$of$measuring$recommender$accuracy$
!Precision$and$recall$can$help$us$measure$recommender$quality$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#45$
Chapter"Topics"
Experimenta0on$and$Evalua0on$
! Measuring"recommender"eecBveness"
! Designing"eecBve"experiments"
! ConducBng"an"eecBve"experiment"
! User"interfaces"for"recommenders"
! Review"quesBons"
! Hands/On"Exercise:"Improving"Recommender"Accuracy"
! EssenBal"points"
! Conclusion$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#46$
ExperimentaBon"and"EvaluaBon"
In$this$chapter$you$have$learned$
!How$tes0ng$can$lead$to$itera0ve$improvements$that$con0nually$benet$
the$organiza0ons$that$conduct$them$
!Why$user$interface$design$is$an$important$component$of$building$and$
deploying$a$recommender$system$
!How$to$determine$if$your$recommender$is$eec0ve$
!What$considera0ons$are$involved$in$experiment$design$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#47$
Bibliography"
The$following$oer$more$informa0on$on$topics$discussed$in$this$chapter$
!Web$Analy0cs$2.0$
http://tiny.cloudera.com/dscc13a
!$Googles$Overlapping$Experiment$Infrastructure$
http://tiny.cloudera.com/dscc13b"
!$Large$Scale$Analysis$of$Interleaved$Search$Evalua0on$
http://tiny.cloudera.com/dscc13c
!Design$of$Experiments$for$Scien0sts$and$Engineers$
http://tiny.cloudera.com/dscc13d
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
13#48$
Produc@on"Deployment"and"Beyond"
Chapter"14"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#1$
Course"Chapters"
! Introduc@on"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"Acquisi@on"
! Evalua@ng"Input"Data"
! Data"Transforma@on"
! Data"Analysis"and"Sta@s@cal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! Introduc@on"to"Apache"Mahout"
! Implemen@ng"Recommenders"with"Apache"Mahout"
! Experimenta@on"and"Evalua@on"
! Produc,on$Deployment$and$Beyond$
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#2$
Produc@on"Deployment"and"Beyond"
In$this$chapter$you$will$learn$
!Which$factors$may$impede$recommender$accuracy$
!How$to$assess$whether$further$improvements$are$worthwhile$
!Which$problems$Mahout$users$oCen$encounter$
!Several$techniques$for$improving$your$recommender$system$
!What$a$data$scien,sts$role$is$in$improving$system$performance$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#3$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying$to$produc,on$
! Tips"and"techniques"for"working"at"scale"
! Summarizing"and"visualizing"results"
! Considera@ons"for"improvement"
! Next"steps"for"recommenders"
! Review"ques@ons"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#4$
Taking"Recommenders"to"Produc@on"
!By$taking$recommenders$to$produc,on$we$mean$the$process$of$moving$
from$oine$analysis$of$the$data$to$making$product$recommenda,ons$to$
users$on$a$live$produc,on$system$
!We$should$view$this$as$a$con,nua,on$of$our$experiments$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#5$
Successful"Deployment"
!Ini,al$experimenta,on$answers$one$key$ques,on$
Does"a"change"in"input"lead"to"a"corresponding"change"in"output"
!Successful$deployment$cycles$should$answer$several$more$$
Does"this"nega@vely"impact"the"produc@on"system?"
How"can"we"reduce"the"cost"of"tes@ng"so"we"can"do"more"of"it?"
Have"we"tested"the"right"thing?"
!And$most$importantly$of$all$
Does"this"create"value"for"the"organiza@on?"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#6$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips$and$techniques$for$working$at$scale$
! Summarizing"and"visualizing"results"
! Considera@ons"for"improvement"
! Next"steps"for"recommenders"
! Review"ques@ons"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#7$
Common"Problems"with"Mahout"
!Those$who$are$new$to$Mahout$oCen$stumble$with$a$few$issues$
!Congura,on$
Ensure"that"the"JAVA_HOME"environment"variable"is"set"
The"hadoop"executable"must"be"in"your"PATH"variable"too"
Performance"seXngs"must"be"appropriate"for"your"Mahout"job"
More"on"this"in"a"moment"
!Input$data$
Ra@ngs"data"for"command/line"jobs"must"be"CSV"
Users"and"items"must"be"iden@ed"by"integer"values"
Malformed"ra@ngs"can"skew"predic@ons"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#8$
Common"Problems"with"Mahout"(contd)"
!Stray$les$and$directories$are$also$a$problem$
Apache"Mahout"doesnt"fail"fast"by"valida@ng"precondi@ons"
!Here$are$a$few$things$to$check$before$running$your$job$
Does"the"output"directory"exist?"
Mahout"(and"Hadoop)"wont"overwrite"exis@ng"output"
Does"the"temp"directory"contain"les"from"a"previous"job?"
The"temp"directory"must"either"be"empty"or"not"exist"
Does"your"input"contain"a"stray"le?"
The"only"les"in"the"input"directory"should"be"ra@ngs"data"
O`en"results"in"an"ArrayIndexOutOfBoundsException"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#9$
Ramping"Up"Your"Deployment"
!Almost$any$test$has$the$poten,al$for$nega,ve$impact$
The"extent"of"which"is"ini@ally"unknown"
!Consider$deploying$your$change$to$a$small$popula,on$of$users$
This"limits"the"poten@al"damage"
!Then$ramp$up$in$phases$once$any$bugs$are$worked$out$
Phase$1""
A"="99%"
B"=""""1%"
Phase$2""
A"="85%"
B"="15%"
Phase$3""
A"="50%"
B"="50%"
B
Original
version
Test version
B
Original
version
Test version
B
Original
version
Test version
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
"
14#10$
Improving"Performance"
!Mahout$(op,onally)$uses$Hadoop$
Advice"about"Hadoop"performance"tuning"generally"applies"
!Mahout$and$Hadoop$are$implemented$in$Java$
Likewise,"advice"about"Java"performance"tuning"helps"too"
!This$is$typically$the$domain$of$system$administrators$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#11$
Improving"Performance"(contd)"
!Data$scien,sts$focus$on$performance$through$op,miza,on$
Not"low/level"congura@on"tuning"like"system"administrators"
!How$might$a$data$scien,st$improve$performance?$
Using"a"be=er"similarity"metric"
Designing"and"implemen@ng"a"be=er"algorithm"
Determining"what"input"could"be"excluded"without"penalty"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#12$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips"and"techniques"for"working"at"scale"
! Summarizing$and$visualizing$results$
! Considera@ons"for"improvement"
! Next"steps"for"recommenders"
! Review"ques@ons"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#13$
Acquiring"Results"from"Produc@on"
!Following$an$experiment,$you$must$analyze$the$data$it$produced$
!You$rst$need$to$retrieve$them$from$the$produc,on$system$
And"the"data"science"lifecycle"repeats"itself"
Define a
Problem
Identify
Desired
Outcome
Determine
Which Data
is Required
Communicate
Results
Evaluate
Possible
Solutions
Make
Improvements
Measure
Effectiveness
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#14$
Summarizing"Data"
!As$with$the$original$data$that$led$to$the$experiment,$one$of$our$rst$steps$
should$be$to$analyze$the$distribu,on$of$data$
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#15$
Summarizing"Data"(contd)"
!Analyzing$the$distribu,on$for$subsets$of$that$data$is$also$useful$
This"can"lead"to"addi@onal"experiments"and"improvements"
Distribution of Movie Ratings Among Users 65 - 75 of Age
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#16$
Visualiza@on"
!Sca`erplots$and$linear$regression$can$help$you$spot$trends$
In"this"case,"it"may"indicate"that"older"people"now"nd"our"
recommenda@ons"more"useful"
Minutes Watched Per Week, by Age
75
70
Control group (A)
Test group (B)
65
60
55
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#17$
Visualiza@on"(contd)"
!Time#series$plots$can$also$reveal$interes,ng$things$to$explore$
These"are"some@mes"problems"with"your"implementa@on"
Number of Movies Watched, by Day of Experiment
70,000
Control group (A)
Test group (B)
60,000
50,000
40,000
30,000
20,000
What caused
this drop?
1
10
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#18$
Visualiza@on"(contd)"
!Visualiza,on$is$used$to$clearly$communicate$results$
Can"be"an"eec@ve"way"to"demonstrate"the"value"youve"created"
Quarterly Revenue Gained by Implementing Each Experiment
$3,000,000
$2,750,000
$2,500,000
$2,250,000
$2,000,000
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#19$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips"and"techniques"for"working"at"scale"
! Summarizing"and"visualizing"results"
! Considera,ons$for$improvement$
! Next"steps"for"recommenders"
! Review"ques@ons"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#20$
More"Accuracy"Can"Be"Dicult"to"Achieve"
!Improving$accuracy$becomes$more$dicult$over$,me$
Only"the"hard"problems"remain"a`er"ini@al"low"hanging"fruit"
$
Aggregate Improvement over Cinematch During First Eight Months of Netflix Prize, by Week
11.00%
10.00%
9.00%
8.00%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
10
11 12 13 14
15 16 17 18
19
20 21 22
23 24 25
26 27 28 29
30 31 32 32 33
Week
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#21$
Increasing"Accuracy"Might"Not"Pay"
!Increased$accuracy$may$be$possible,$but$not$be$worth$the$eort$
On"the"other"hand,"it"very"well"could"be"
!Whether$the$eort$is$a$good$investment$depends$largely$on$scale$
A"@ny"improvement"for"a"small"e/commerce"site"may"net"$5,000"
The"same"improvement"for"a"huge"retailer"could"net"$5,000,000"
Whether"it"pays"o"depends"on"how"much"it"costs"to"implement"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#22$
Perfect"Accuracy"Is"Unobtainable"
!Recommenders$consume$input$provided$by$humans$
We"dont"know"how"they"feel,"we"only"know"what"they"tell"us"
And"what"they"tell"us"is"inconsistent"
!Ra,ngs$are$subjec,ve$and$ephemeral$
Mood,"seXng,"and"many"other"factors"can"aect"ra@ngs"
A"user"may"rate"the"same"item"dierently"under"other"condi@ons"
These"inconsistencies"should"be"considered"noise"in"the"data"
!An$RMSE$of$0.0$may$be$possible$
But"this"doesnt"mean"the"recommender"is"perfectly"accurate"
It"really"indicates"over&ng"(not"properly"accoun@ng"for"noise)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#23$
Reconsider"Your"Goals"
!If$you$havent$achieved$the$desired$result,$take$a$moment$to$consider$
whether$youre$solving$the$right$problem$$
Your"goals"need"to"be"aligned"with"business"interests"
!Recommender$accuracy$is$not$our$actual$goal$
Increasing"customer"sa@sfac@on"is"more"important"
Reducing"the"customer"cancela@on"rate"is"more"important"s@ll"
Crea@ng"a"long/term"increase"in"prots"is"most"important"of"all"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#24$
Reconsider"Your"Metrics"
!Also$reconsider$how$you$measure$progress$towards$this$goal$
!Metrics$oCen$focus$on$short#term$observa,ons$for$comparison$
But"long/term"metrics"are"the"true"measure"of"success"
!Take$the$example$of$an$e#commerce$site$
Time"spent"on"site"is"not"a"good"short/term"metric"
Average"order"value"is"be=er"
!A$single$number$can$be$misleading$
Consider"using"mul@ple"metrics"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#25$
The"Performance"Metric"is"Key"
!Quality$and$accuracy$are$important$considera,ons$
But"we"must"not"lose"sight"of"the"original"goals"
!Another$key$measure$of$eec,veness$is$performance$
How"many"of"our"recommenda@ons"did"the"user"accept?"
!Performance$is$essen,al$to$a$recommender$
Its"a"prerequisite"for"a"meaningful"measure"of"eec@veness"
More"importantly,"an"unused"system"has"li=le"value"to"the"user"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#26$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips"and"techniques"for"working"at"scale"
! Summarizing"and"visualizing"results"
! Considera@ons"for"improvement"
! Next$steps$for$recommenders$
! Review"ques@ons"
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#27$
How"Can"You"Improve"Your"Recommender?"
!If$youve$decided$to$con,nue$your$investment$in$the$recommender,$there$
are$many$improvements$to$consider$
!We$suggest$pursuing$these$through$project$lifecycle$steps$
Dene"a"problem"
Iden@fy"desired"outcome"
Evaluate"possible"solu@ons"
Determine"which"data"is"required"to"implement"the"solu@on"
Make"the"improvement"
Measure"its"eec@veness"through"experimenta@on"
!Well$now$look$at$some$possible$solu,ons$to$consider$
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#28$
Changing"Rela@ve"Scores"
!We$can$experiment$with$the$rela,ve$weight$of$scoring$factors$
Some"factors"may"be"more"important"than"rst"thought"
Other"factors"may"be"less"important"
S@ll"others"may"be"unimportant"or"even"detrimental"
Previous$%$
New$%$
Descrip,on$of$Criteria$
42%""
"38%" How"customer"rated"similar"movies"
28%"
"32%" Cost"of"royalty"payment"required"to"show"this"movie"
17%"
"16%" Whether"the"user"searched"for"this"movie"by"name"
8%"
5%"
"0%" How"long"this"movie"has"been"in"the"customers"queue"
"14%" Popularity"of"this"movie"among"others"in"customers"region"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#29$
Adding"Addi@onal"Scoring"Factors"
!Its$oCen$worthwhile$to$add$en,rely$new$factors$
Remember"that"more"data"typically"yields"be=er"results"
!Implicit$feedback$can$be$especially$valuable$
This"can"oer"a"more"accurate"view"of"whats"important"to"them""
Score$%$
Descrip,on$of$Criteria$
34%"" How"customer"rated"similar"movies"
29%" Cost"of"royalty"payment"required"to"show"this"movie"
16%$ Did$user$previously$ignore$this$recommenda,on$
14%" Whether"the"user"searched"for"this"movie"by"name"
5%" How"long"this"movie"has"been"in"the"customers"queue"
2%" Popularity"of"this"movie"among"others"in"customers"region"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#30$
Experiment"with"Other"Algorithms"
!We$may$nd$a$that$another$algorithm$works$be`er$for$this$case$
But"remember"that"more"data"is"usually"more"important"
1.00
Test Accuracy
0.95
0.90
0.85
0.80
Memory-Based
Winnow
0.75
Perceptron
Naive Bayes
0.70
0.1
10
Millions of Words
100
1000
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#31$
Matrix"Factoriza@on"Techniques"
!Matrix$factoriza,on$compares$the$intersec,on$of$two$proper,es$
Can"be"used"to"make"more"appropriate"recommenda@ons"
Late Night
Old School
Sideways
Delta Force
Beaches
Females
Males
Pretty in Pink
Ferris Bueller's Day Off
Snow White
Shrek
Early Morning
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#32$
Hybrid"Systems"
!We$may$integrate$aspects$of$a$content#based$recommender$
!Content#based$recommenders$consider$proper,es$such$as$
Actors"
Director"
Screenwriter"
Theme"
Language"
!May$take$into$account$customers$ra,ngs$of$the$following$
Movies"which"feature"the"same"actor(s)"or"director"
Movies"with"similar"plots""
Movies"lmed"in"the"same"loca@on"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#33$
Hybrid"Systems"(contd)"
!Current$,me$of$year$and$a$movies$theme$can$inuence$choices$
Sleepless-in-Sea0le"might"be"good"near"Valen@nes"Day"
Halloween"is"more"appropriate"for"October"
!We$might$consider$the$customers$age$and$the$movies$era$
Recommend"Grease-to"someone"who"is"70"
Recommend"The-Breakfast-Club"to"someone"who"is"40"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#34$
User"Interface"Improvements"
!The$UI$is$an$important$part$of$the$recommender$system$
Simple"enhancements"can"o`en"yield"signicant"gains"
!Possible$user$interface$improvements$
How"we"collect"movie"ra@ngs"from"customers"
How"customers"browse"for"movies"by"category"
How"customers"search"for"specic"movies"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#35$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips"and"techniques"for"working"at"scale"
! Summarizing"and"visualizing"results"
! Considera@ons"for"improvement"
! Next"steps"for"recommenders"
! Review$ques,ons$
! Essen@al"points"
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#36$
Review"Ques@ons"
!Why$might$a$small$e#commerce$site$not$benet$from$improving$their$
recommender$while$a$much$larger$site$would?$
!Can$you$name$three$ways$that$Cloudera$Movies$might$be$able$to$improve$
its$recommenda,ons?$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#37$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips"and"techniques"for"working"at"scale"
! Summarizing"and"visualizing"results"
! Considera@ons"for"improvement"
! Next"steps"for"recommenders"
! Review"ques@ons"
! Essen,al$points$
! Conclusion"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#38$
Essen@al"Points"
!Taking$a$recommender$to$produc,on$is$oCen$just$the$next$step$in$a$
con,nued$cycle$of$experimenta,on$
!Its$a$good$idea$to$validate$input$and$output$before$submilng$a$job$to$
Apache$Mahout$
!General$advice$for$Hadoop$and$Java$can$be$useful$in$tuning$Mahouts$
performance$
!There$are$many$techniques$that$can$improve$a$recommender$
But"its"not"always"worthwhile"or"even"possible"to"do"so"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#39$
Chapter"Topics"
Produc,on$Deployment$and$Beyond$
! Deploying"to"produc@on"
! Tips"and"techniques"for"working"at"scale"
! Summarizing"and"visualizing"results"
! Considera@ons"for"improvement"
! Next"steps"for"recommenders"
! Review"ques@ons"
! Essen@al"points"
! Conclusion$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#40$
Produc@on"Deployment"and"Beyond"
In$this$chapter$you$have$learned$
!Which$factors$may$impede$recommender$accuracy$
!How$to$assess$whether$further$improvements$are$worthwhile$
!Which$problems$Mahout$users$oCen$encounter$
!Several$techniques$for$improving$your$recommender$system$
!What$a$data$scien,sts$role$is$in$improving$system$performance$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#41$
Bibliography"
The$following$oer$more$informa,on$on$topics$discussed$in$this$chapter$
!Incorpora,ng$Contextual$Informa,on$into$Recommender$Systems$via$a$
Mul,dimensional$Approach$
http://tiny.cloudera.com/dscc14a
!Towards$the$Next$Genera,on$of$Recommender$Systems$
http://tiny.cloudera.com/dscc14b
!Matrix$Factoriza,on$Techniques$for$Recommender$Systems$
http://tiny.cloudera.com/dscc14c
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
14#42$
Conclusion"
Chapter"15"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#1$
Course"Chapters"
! IntroducAon"
! Data"Science"Overview"
! Use"Cases"
! Project"Lifecycle"
! Data"AcquisiAon"
! EvaluaAng"Input"Data"
! Data"TransformaAon"
! Data"Analysis"and"StaAsAcal"Methods"
! Fundamentals"of"Machine"Learning"
! Recommender"Overview"
! IntroducAon"to"Apache"Mahout"
! ImplemenAng"Recommenders"with"Apache"Mahout"
! ExperimentaAon"and"EvaluaAon"
! ProducAon"Deployment"and"Beyond"
! Conclusion$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#2$
Chapter"Topics"
Conclusion$
! Essen1al$points$
! Next"steps"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#3$
EssenAal"Points"
!Data$science$is$the$combina1on$of$several$skills,$including$mathema1cs,$
soAware$engineering,$communica1ons,$and$domain$experience$
!Data$products$are$built$from$data$and$produce$new$data$as$theyre$used,$
allowing$them$to$be$further$improved$
!Key$themes$of$data$science$use$cases$include$
Access"to"large"amounts"of"data"
AcquisiAon"of"several"types"of"data"from"dierent"sources"
The"ability"to"analyze"this"data"at"scale"to"nd"interesAng"pa=erns"
Problem"solving"focused"on"a"specic"and"acAonable"result"
!The$lifecycle$of$a$data$science$project$is$itera1ve$
StaAng"a"problem"and"constantly"measuring"results"are"key"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#4$
EssenAal"Points"(contd)"
!Data$can$be$found$both$inside$and$outside$your$organiza1on$
Its"not"always"provided"in"the"most"convenient"formats"
Issues"with"data"quality"are"inevitable"at"scale"
Filtering,"normalizaAon,"and"transformaAon"are"oZen"required"
!Summarizing$and$visualizing$data$is$an$important$rst$step$
These"techniques"can"reveal"errors"or"other"interesAng"pa=erns"
Skewed"distribuAons"can"be"misleading"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#5$
EssenAal"Points"(contd)"
!More$data$is$usually$preferable$to$a$beNer$algorithm,$at$scale$
!Machine$learning$is$oAen$described$in$terms$of$the$Three$Cs$
CollaboraAve"Filtering"(recommendaAons)"
Clustering"(grouping"items"into"subsets)"
ClassicaAon"(idenAfying"items"by"type)"
!Two$main$approaches$to$recommenders$
Content/based"recommenders"consider"an"item"a=ributes"
CollaboraAve"ltering"considers"the"acAons"of"other"users"
A"hybrid"of"these"two"approaches"is"also"possible"
!Collabora1ve$ltering$can$be$user#based$or$item#based$
The"item/based"approach"tends"to"scale"be=er"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#6$
EssenAal"Points"(contd)"
!Preferences$can$be$captured$explicitly$or$implicitly$
!Similarity$metrics$are$chosen$based$on$preference$data$type$
Binary"data"
Tanimoto"coecient"
Log"likelihood"
Numeric"data"
Euclidean"distance"
Uncentered"cosine"
Pearson"correlaAon"
!Theres$no$best$similarity$metric$
Each"one"is"be=er"in"some"cases"than"others"
Magnitude"and"raAngs"bias"are"also"important"consideraAons"
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#7$
EssenAal"Points"(contd)"
!Apache$Mahout$is$an$open$source$machine$learning$library$
Its"really"a"collecAon"of"various"implementaAons"and"uAliAes""
Many"(though"not"all)"of"Mahouts"algorithms"can"use"Hadoop"for"
be=er"performance"and"scalability"
!Scoring$is$a$key$aspect$of$building$recommender$systems$
Indicates"relaAve"importance"of"various"criteria"
OZen"requires"domain"knowledge"
Can"yield"signicant"value"for"the"organizaAon"
!The$user$interface$is$an$essen1al$part$of$recommender$systems$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#8$
EssenAal"Points"(contd)"
!An$experiment$can$help$you$evaluate$dierent$solu1ons$
And"select"the"ones"that"will"provide"the"best"benet"
!Successful$organiza1ons$constantly$experiment$
They"oZen"employ"A/B"tests"as"a"way"of"comparing"treatments"
Data"ma=ers"more"than"opinions"when"making"decisions"
!Measuring$recommender$eec1veness$frequently$involves$
RMSE"(accuracy)"
Precision"and"recall"(quality)"
Performance"(acceptance"rate)"
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#9$
EssenAal"Points"(contd)"
!Produc1on$deployment$is$oAen$just$the$start$of$a$new$experiment$
!There$are$many$techniques$for$making$further$improvements$
Not"all"of"these"will"prove"worthwhile"to"implement"
Always"consider"your"costs"and"potenAal"gains"
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#10$
Chapter"Topics"
Conclusion$
! EssenAal"points"
! Next$steps$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#11$
Which"Course"to"Take"Next?"
Cloudera$oers$a$range$of$training$courses$$
!For$developers$
Developer"Training"for"Apache"Hadoop"
Cloudera"Training"for"Apache"HBase"
!For$analysts$and$DBAs$
Cloudera"Training"for"Apache"Hive"and"Pig"
!For$architects,$managers,$CIOs$and$CTOs$
EssenAals"for"Apache"Hadoop"
!For$system$administrators$
Administrator"Training"for"Apache"Hadoop"
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#12$
CerAcaAon"Overview"
!Cloudera$oers$a$range$of$industry#recognized$cer1ca1ons$
Cloudera"CerAed"Developer"for"Apache"Hadoop"
Cloudera"CerAed"Administrator"for"Apache"Hadoop"
Cloudera"CerAed"Specialist"in"Apache"HBase"
!Cloudera$Cer1ed$Professional$(CCP)$Data$Scien1st$coming$in$2013$
Our"Data"ScienAst"cerAcaAon"will"involve"a"two/step"process"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#13$
CerAcaAon"Overview"(contd)"
!Step$One:$CCP$Data$Scien1st$WriNen$Examina1on$
A"two/hour"wri=en"qualicaAon"exam"covering"concepts"and"skills"from"
a"broad"range"of"data"science"topics"
!Step$Two:$CCP$Data$Scien1st$Lab$Examina1on$
This"hands/on"exam"measures"both"your"technical"ability"and"your"
capacity"to"develop"creaAve"approaches"to"building"data"products."
Must"pass"wri=en"exam"before"youre"eligible"to"schedule"the"lab"exam"
!Youll$receive$a$free$invita1on$to$par1cipate$in$the$beta$and$nal$versions$
of$the$wriNen$exam$when$theyre$available$
!For$more$informa1on$about$Cloudera$cer1ca1on,$refer$to$$
http://university.cloudera.com/certification.html
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#14$
Thanks!"
!Thank$you$for$aNending$the$course!$
!If$you$have$any$ques1ons$or$comments,$please$contact$us$via$$
http://www.cloudera.com/
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
15#15$
Hadoop"Overview"
Appendix"A"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"1$
What"is"Apache"Hadoop?"
!Its$a$scalable$data$storage$and$processing$system$
Open"source"Apache"project""
Harnesses"the"power"of"industry/standard"hardware"
Distributed"and"fault/tolerant""
Mostly"wri=en"in"Java"
!Core$Hadoop$consists$of$two$main$parts$
Storage:"Hadoop"Distributed"File"System"(HDFS)"
Processing:"MapReduce"
!Many$other$tools$use$Hadoop$or$are$built$on$top$of$Hadoop$
This"includes"Hive,"Sqoop,"Flume,"and"Mahout"
These"are"collecSvely"known"as"the"Hadoop"ecosystem"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"2$
HDFS:"Hadoop"Distributed"File"System"
!Based$on$Googles$GFS$(Google$File$System)$
Highly"opSmized"for"processing"data"with"MapReduce"
!Provides$storage$for$massive$amounts$of$data$
Using"inexpensive"commodity"hardware"
Cost"per"GB"is"typically"about"1/10th"that"of"enterprise"storage"
!At$load$Jme,$data$is$distributed$across$all$nodes$
This"improves"reliability"and"performance"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"3$
Comparing"HDFS"to"Other"Filesystems"
!HDFS$is$not$built$into$the$operaJng$system$
It"must"be"accessed"via"the"hadoop fs"command"
Or"via"the"REST"or"Java"APIs"
!In$some$ways,$HDFS$is$similar$to$a$UNIX$lesystem$
Hierarchical"
UNIX/style"paths"(e.g."/foo/bar/myfile.txt)"
File"ownership"and"permissions"
!There$are$also$some$major$deviaJons$from$UNIX$
No"concept"of"a"current"directory"
Cannot"modify"les"once"wri=en""
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"4$
Using"Hadoop"Commands"to"Access"HDFS"
!The$hadoop fs$uJlity$has$several$subcommands$
These"are"very"similar"to"standard"UNIX"commands"
!The$examples$below$show$a$few$common$uses$
1. List"the"contents"of"the"/clouderamovies"directory"
2. Display"the"contents"of"a"le"stored"in"HDFS"
3. Copy"a"local"le"to"the"/clouderamovies"directory"in"HDFS"
4. Copy"a"le"from"HDFS"to"the"local"lesystem"
$ hadoop fs -ls /clouderamovies
$ hadoop fs -cat /clouderamovies/remotefile.txt
$ hadoop fs -put localfile.txt /clouderamovies
$ hadoop fs -get /clouderamovies/remotefile.txt
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"5$
MapReduce"IntroducSon"
!MapReduce$is$not$a$language,$its$a$programming$model$
A"style"of"processing"data"you"could"implement"in"any"language"
!MapReduce$has$its$roots$in$funcJonal$programming$
Many"languages"have"funcSons"named"map"and"reduce
These"funcSons"have"largely"the"same"purpose"in"Hadoop"
!Popularized$for$large"scale$data$processing$by$Google$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"6$
MapReduce"Benets"
!Scalable$because$you$process$one$record$at$a$Jme$
The"input"data"and"processing"tasks"can"be"distributed"across"
(potenSally"thousands"of)"machines"for"faster"processing"
!Complex$details$are$abstracted$away$from$the$developer$
MapReduce"jobs"dont"require"wriSng"code"to"handle"
Reading"input"from"les"
WriSng"output"to"les"
Networking"among"nodes"in"a"cluster"
SynchronizaSon"between"processes"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"7$
Understanding"Map"and"Reduce"
!MapReduce$consists$of$two$funcJons:$Map$and$Reduce$
The"output"from"Map"becomes"the"input"to"Reduce"
Hadoop"automaScally"sorts"and"groups"data"between"these"funcSons"
!The$Map$funcJon$always$runs$rst$
Typically"used"to"lter,"transform,"or"parse"data"
!The$Reduce$funcJon$is$opJonal$
Normally"used"to"summarize"data"from"the"Map"funcSon"
Since"this"isnt"always"needed,"you"can"run"Map/only"jobs"
!Each$piece$is$simple,$but$can$be$powerful$when$combined$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"8$
MapReduce"Example"in"Python"
!MapReduce$code$for$Hadoop$is$typically$wri^en$in$Java$
But"possible"to"use"nearly"any"language"with"Hadoop'Streaming"
!The$following$example$will$use$MapReduce$in$Python$
It"processes"log"les"in"order"to"summarize"events"by"type"
!The$example$will$illustrate$both$the$data$ow$and$the$code$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"9$
Job"Input"
!Our$job$will$count$the$event$type$(highlighted$below)$across$all$log$les$
supplied$as$input$to$the$job$
$
2012-09-06
2012-09-06
2012-09-06
2012-09-06
2012-09-06
2012-09-06
2012-09-06
22:16:49.391
22:16:49.392
22:16:49.394
22:16:49.395
22:16:49.397
22:16:49.398
22:16:49.399
CDT
CDT
CDT
CDT
CDT
CDT
CDT
!Each$mapper$gets$a$chunk$of$enJre$jobs$input$data$to$process$
This"chunk"is"called"an"InputSplit$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"10$
Python"Code"for"Map"FuncSon"
!Our$map$funcJon$will$parse$the$event$type$
And"then"output"that"event"(key)"and"a"literal"1"(value)"
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python
import sys
levels = ['TRACE', 'DEBUG', 'INFO',
'WARN', 'ERROR', 'FATAL']
Dene"list"of"known"log"events"
Split"every"line"(record)"we""
receive"on"standard"input"
into"elds,"normalized"by"case"
If"this"eld"matches"a"log"
level,"print"it,"a"tab"separator,"
and"the"literal"value"1"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"11$
Output"of"Map"FuncSon"
!The$map$funcJon$produces$key/value$pairs$as$output$
INFO
INFO
WARN
INFO
WARN
INFO
ERROR
1
1
1
1
1
1
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"12$
Input"to"Reduce"FuncSon"
!The$Reducer$receives$a$key$and$all$values$for$that$key $$
Keys"are"always"passed"to"reducers"in"sorted"order"
Although"not"obvious"here,"values"are"unordered"
ERROR
INFO
INFO
INFO
INFO
WARN
WARN
1
1
1
1
1
1
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"13$
Python"Code"for"Reduce"FuncSon"
!The$Reducer$rst$extracts$the$key$and$value$it$was$passed$
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python
import sys
previous_key = ''
sum = 0
for line in sys.stdin:
fields = line.split()
key, value = line.split()
IniSalize"loop"variables"
Extract"the"key"and"value"
passed"via"standard"input"
value = int(value)
# continued on next slide
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"14$
Python"Code"for"Reduce"FuncSon"
!Then$simply$adds$up$the$value$for$each$key$
14
15
16
17
18
19
20
21
22
23
If"key"unchanged,""
increment"the"count"
If"key"changed,"print"
sum"for"previous"key"
Re/init"loop"variables"
Print"sum"for"nal"key"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"15$
Output"of"Reduce"FuncSon"
!The$output$of$this$Reduce$funcJon$is$a$sum$for$each$level$
ERROR
INFO
WARN
1
4
2
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"16$
Recap"of"Data"Flow"
$$
Map"input"
2012-09-06
2012-09-06
2012-09-06
2012-09-06
2012-09-06
2012-09-06
2012-09-06
Map"output"
INFO
INFO
WARN
INFO
WARN
INFO
ERROR
1
1
1
1
1
1
1
22:16:49.391
22:16:49.392
22:16:49.394
22:16:49.395
22:16:49.397
22:16:49.398
22:16:49.399
CDT
CDT
CDT
CDT
CDT
CDT
CDT
Reduce"input"
ERROR
INFO
INFO
INFO
INFO
WARN
WARN
1
1
1
1
1
1
1
Reduce"output"
ERROR
INFO
WARN
1
4
2
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"17$
How"to"Run"a"Hadoop"Streaming"Job"
!Jobs$are$submi^ed$via$the$hadoop jar$command$
!The$$STREAMJAR$variable$is$dened$in$your$virtual$machine$
It"points"to"the"Hadoop"Streaming"JAR"le"
Outside"of"your"VM,"youll"have"to"specify"this"les"path"
Exact"name"and"locaSon"vary"based"on"Hadoop"version"
!The$-input$argument$species$HDFS$path$for$job$input$
!The$-output$argument$species$HDFS$path$for$job$output$
It"must"not"already"exist"or"the"job"will"fail"
$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"18$
Input"Data"Feeds"the"Map"Tasks"
!Input$for$the$enJre$job$is$subdivided$into$InputSplits$$
Each"of"these"serves"as"input"to"a"single"Map"task"
Input for entire job
(192 MB)
64 MB
Mapper #1
64 MB
Mapper #2
64 MB
Mapper #3
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"19$
Mappers"Feed"the"Shue"and"Sort"
!Output$of$all$Mappers$is$parJJoned,$merged,$and$sorted$
No"code"required""Hadoop"does"this"automaScally""
1
1
1
1
1
ERROR
ERROR
ERROR
1
1
1
Mapper #1
INFO
WARN
INFO
INFO
ERROR
Mapper #2
WARN
INFO
INFO
INFO
ERROR
1
1
1
1
1
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
1
1
1
1
1
1
1
1
Mapper #N
WARN
INFO
WARN
INFO
ERROR
1
1
1
1
1
WARN
WARN
WARN
WARN
1
1
1
1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"20$
Shue"and"Sort"Feeds"the"Reducers"
!All$values$for$a$given$key$are$then$collapsed$into$a$list$
The"key"and"all"its"values"are"fed"to"reducers"as"input"
Note"that"it"is"common"for"jobs"to"have"only"one"reducer"
ERROR
ERROR
ERROR
1
1
1
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
1
1
1
1
1
1
1
1
WARN
WARN
WARN
WARN
1
1
1
1
ERROR
1 1 1
Reducer #1
INFO
1 1 1 1 1 1 1 1
Reducer #2
WARN
1 1 1 1
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"21$
Each"Reducer"Produces"an"Output"File"
!These$are$stored$in$HDFS$below$your$output$directory$
Use"hadoop fs -getmerge"to"combine"them"into"a"local"copy"
INFO
ERROR
WARN
3
4
Reducer #1
Reducer #2
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
A"22$
MathemaAcal"Formulas"
Appendix"B"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
B"1$
Tanimoto"Coecient"
!The$Tanimoto$coecient$measures$the$items$shared$between$two$sets$
and$can$be$expressed$using$the$formula$below$
T=
Nc
( Na + Nb Nc )
!Where$$
T"is"the"Tanimoto"coecient"
Na"is"the"number"of"items"in"set"A"
Nb"is"the"number"of"items"in"set"B"
Nc"is"the"number"of"items"in"the"intersecAon"of"sets"A"and"B"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
B"2$
Log"Likelihood"
!Log$likelihood$is$a$similarity$metric$for$binary$data$
!CompuBng$the$log$likelihood$is$a$mulB"step$process$that$involves$rst$
compuBng$the$entropy$for$the$values$in$rows$and$columns$for$a$
conBngency$matrix$like$this$
$ How$many$watched$Y?$
$
How$many$did$not$watch$Y?$
How$many$watched$X?$
How$many$did$not$watch$X?$
#"who"watched"both"X"and"Y"
#"who"watched"Y"but"not"X"
#"who"watched"X"but"not"Y"
#"who"watched"neither"X"nor"Y"
!The$code$that$Mahout$uses$to$calculate$log$likelihood$can$be$found$here$
http://tiny.cloudera.com/dsc_apc_01
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
B"3$
Euclidean"Distance"
!Euclidean$distance$measures$the$distance$between$two$points$and$can$be$
expressed$using$the$formula$below$
n
D="
i=1
2"
( qi pi )
!Where$$
D"is"the"Euclidean"distance"
qi"and"pi"represent"coordinates"for"corresponding"points"in"a"set"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
B"4$
Pearson"CorrelaAon"
!The$Pearson$correlaBon$is$used$to$measure$similarity$between$sets$of$
values$and$can$be$expressed$using$the$formula$below$
P=
XY -
( X 2
("
X) 2
N"
X Y
N
) ( Y
Y) 2
2("
N"
!Where$
P"is"the"Pearson"CorrelaAon"
X"is"the"rst"set"of"values"
Y"is"the"second"set"of"values"
N"is"the"size"of"set"X"(which"must"also"match"the"size"of"set"Y)"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
B"5$
Root"Mean"Squared"Error"
!Root$Mean$Squared$Error$(RMSE)$is$a$common$metric$for$measuring$the$
accuracy$of$predicted$raBngs$and$can$be$expressed$using$the$formula$
below$
n
RMSE = "
(x
yi )
2"
i=1
N"
!Where$
"xi"is"a"value"from"the"set"of"predicted"raAngs"
"yi"is"a"value"from"the"set"of"actual"raAngs"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
B"6$
Language"and"Tool"Reference"
Appendix"C"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"1$
UNIX"Command"Line"Tools:"Finding"Help
!The$next$few$slides$explain$a$few$common$UNIX$commands.$$$
!You$can$nd$complete$informa@on$by$using$the$man$command$
For"example,"use"this"command"to"read"about"the"sort"uHlity"
$
$
$ man sort
!Many$commands$show$a$summary$when$called$with$--help$
$ sort --help
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"2$
UNIX"Command"Line"Tools:"Viewing"Files
!The$cat$command$can$display$the$en@re$content$of$a$le$
$ cat example.txt
This is the first line of
the example.txt file. There
are only four lines of text
and this is the last line.
!You$can$use$the$>$redirec@on$operator$to$concatenate$the$contents$of$
several$les$together$
$ cat file01.txt file02.txt > combined.txt
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"3$
UNIX"Command"Line"Tools:"head"and"tail
!The$head$command$can$display$just$the$rst$few$lines$of$a$le$
It"shows"ten"lines"by"default"
Use"the"-n"opHon"to"see"a"specied"number"of"lines"
"
$ head -n1 example.txt
This is the first line of
"
!The$tail$command$is$similar,$but$shows$the$last$few$lines$
Like"the"head"command,"it"also"supports"a"-n"opHon"
$ tail -n2 example.txt
are only four lines of text
and this is the last line.
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"4$
UNIX"Command"Line"Tools:"cut
!The$cut$command$can$display$a$par@cular$eld$from$a$le$
Based"on"the"index"specied"in"the"-f"opHon"
This"example"also"demonstrates"how"commands"can"be"connected"
together"via"the"pipe"|"operator""
$ head
202
212
213
-n 3 tab-delimited-data.txt
Washington D.C.
New York City
Los Angeles
!An$alternate$delimiter$can$be$specied$via$the$-d$op@on$
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"5$
UNIX"Command"Line"Tools:"sort"and"uniq
!The$sort$command$can$be$used$to$sort$the$contents$of$a$le$
This"is"done"in"lexicographic"order"by"default,"but"you"can"sort"
numerically"by"specifying"the"-n"opHon"
The"-r"opHon"will"sort"in"reverse"order"
"
$ sort -rn areacodes.txt | head -n 2
989
Michigan
985
Louisiana
!The$uniq$command$will$remove$all$duplicates$from$sorted$input$
The"-c"opHon"will"count"the"occurrences"of"the"input"data"
$ sort names.txt | uniq -c | sort -rn
61
Smith
31
Brown
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"6$
UNIX"Command"Line"Tools:"grep
!The$grep$command$is$used$to$lter$input$data$
Case/sensiHve"by"default;"case/sensiHvity"can"be"disabled"via"-i"opHon"
"
$ "grep -i smith addressbook.txt
Smith, Jim
Smith, Linda
(314) 555-7234
(415) 555-3678
$
The"-v"opHon"will"output"only"lines"that"do"not"match"the"pa=ern"
!The$egrep$variant$is$similar,$but$supports$regular$expressions$
$ egrep -i 'sm[iy]the?' addressbook.txt
Smith, Jim
(314) 555-7234
Smith, Linda (415) 555-3678
Smithe, Joe
(213) 555-1395
Smythe, Paula (504) 555-6128
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"7$
Sqoop
!Sqoop$exchanges$data$between$a$rela@onal$database$and$HDFS$
Can"import"all"tables,"a"single"table,"or"a"parHal"table"into"HDFS""
Data"can"be"imported"in"delimited"text"or"Avro"le"format"
Sqoop"can"also"export"data"from"HDFS"to"a"database"
!The$following$command$imports$a$table$from$a$MySQL$database$into$
HDFS$as$a$tab"delimited$le$
$ sqoop import \
--connect jdbc:mysql://localhost/movielens \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /clouderamovies \
--table movie
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"8$
Hive
!Hive$is$an$alterna@ve$to$wri@ng$low"level$MapReduce$code$
Users"can"analyze"data"stored"in"Hadoop"data"via"HiveQL"
HiveQL"is"a"declaraHve"language"very"similar"to"SQL"
!Hive$does$not$turn$your$Hadoop$cluster$into$a$database$
Instead,"the"Hive"interpreter"turns"HiveQL"into"MapReduce"jobs"
Hive"tables"are"simply"directories"of"data"stored"in"HDFS"
The"create table"statement"instructs"Hive"how"to"parse"it"
!Hive$is$especially$useful$for$joining$data$
SELECT customer.id, customer.name, sum(order.cost)
FROM customer INNER JOIN order
ON (customer.id = order.customer_id)
WHERE customer.zipcode = '63105'
GROUP BY customer.id, customer.name;
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"9$
Invoking"Hive
!There$are$three$main$ways$to$execute$HiveQL$commands$
1. Directly,"via"the"Hive"shell"
2. By"specifying"a"HiveQL"command"in"a"string"
3. By"specifying"a"text"le"containing"HiveQL"code"
$ hive
hive> select count(*) from mytable;
912
$ hive e "select count(*) from mytable"
912
$ hive -f mycommands.hql
912
NOTE:"log"and"status"messages"normally"seen"when"running"Hive"commands"have"been"removed"for"brevity"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"10$
Python"Basics
!Python$is$a$popular$general"purpose$programming$language$
!Similar$to$Java$in$some$ways$
High/level"language"
Supports"object/oriented"programming"
Cross/pla_orm"(Linux,"Macintosh,"Windows,"and"other"systems)"
!Diers$from$Java$in$others$ways$
No"compilaHon"step"required"
Dynamically"typed""
This"means"youre"not"required"to"explicitly"state"what"type"of"data"
a"variable"holds,"as"you"do"in"Java"
Uses"whitespace"(rather"than"braces)"to"delimit"blocks"of"code"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"11$
Accessing"Python
!You$can$interact$directly$with$the$Python$interpreter$
The">>>"represents"the"interpreters"prompt"
$
$
$ python
>>> print "hello python"
hello python
!However,$its$more$common$to$run$source$code$from$a$le$
$ cat myprogram.py
print "hello world"
$ python myprogram.py
hello python
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"12$
Python:"Strings
!A$string$is$dened$inside$double$quotes$
In"Python,"strings"can"be"treated"as"character"arrays"(see"below)"
Arrays"in"are"indexed"from"zero,"as"in"C"or"Java"
!The$len$func@on$returns$the$length$of$the$string$passed$to$it$
!The$strip()$method$removes$leading$and$trailing$whitespace$
>>> x = "example"
>>> print x[0]
e
>>> print x[0:4]
exam
>>> x = " example "
>>> print len(x)
9
>>> print len(x.strip())
7
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"13$
Python:"Loops"and"CondiHonal"Expressions
!Python$uses$whitespace$to$denote$blocks$of$code$
Be"careful"how"you"indent"things!"
!Loops$and$condi@onal$expressions$both$contain$a$colon$
x = "this is a test"
for word in x.split():
if word == "test":
print "It's a test!"
else:
print "Not a test"
"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
C"14$