Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Scalable Machine Learning Pipeline for Metadata Discovery
from eBay Listings
Qing Zhang, Rui Li
eBay
Spark Summit 2016, June 6-8 San Francisco

2

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Table of Contents
1 Who We Are
2 Metadata Discovery and Challenges
3 Spark Solution
4 Summary

3

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
eBay Structured Data
Metadata discovery and
management
Listing classification
Catalog and mapping listing
to product
Inventory insights

4

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
eBay Structured Data
Metadata discovery and
management
Listing classification
Catalog and mapping listing
to product
Inventory insights

5

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
eBay Structured Data
Metadata discovery and
management
Listing classification
Catalog and mapping listing
to product
Inventory insights

6

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Metadata Zoom In
Important name-value pairs: brand - Dell
Selling flow item specifics
Search navigation
Powers internal applications

7

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Metadata Discovery
memory box weiss yamaha fisher-price generic
modway duluth trading mek usa dnm other sahara club
gokey longhorn outdoor gear trax wolverine
sk spiderman vintage mixed orchard corset
Highly rely on manual review
Unfamiliar candidates
The same candidate appears in multiple categories

8

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Challenges in Metadata Discovery
Term Site Categories
scott james US Men’s Clothing: Blazers & Sport Coats
Men’s Clothing: Pants
Men’s Clothing: Casual Shirts
Men’s Clothing: Dress Shirts
tiella US Chandeliers & Ceiling Fixtures
Lighting Parts & Accessories
turf US Sports Mem, Cards & Fan Shop: Cards: Football
turf UK Collectables: Cigarette/Tea/Gum Cards: Cigarette Cards:
Other Cigarette Cards

9

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Data Driven Approach for Brand Discovery
Utilize seller input item specifics
Utilize supply demand signals from sellers and buyers
Training data available from previously reviewed candidates

10

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Supervised Machine Learning Approach
Data : 35,000 previously human reviewed metadata candidates
Feature : supply and demand signals
Prototypes with Python Scikit
Logistic regression, gradient boosting trees, random forest etc
Random forest F1 0.878

11

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Production Development
In the past, train offline and implement prediction component on production
File transferring and configurations are time-consuming
Dev
Local
Production

12

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Spark and Spark MLlib
MLlib GraphX
Spark provides powerful data processing APIs
MLlib is a comprehensive machine learning package powered by Spark
Regression, classification, clustering, dimensionality reduction etc
Efficient development with local model, and flexible file access

13

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
The Machine Learning SystemSpark
Spark
MLlib
Feature Training Prediction

14

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Model Training with MLlib
val pipeline = new Pipeline ()
.setStages(Array(labelIndexer , featureIndexer ,
rf , labelConverter ))
val evaluator = new MulticlassClassificationEvaluator ()
.setLabelCol("indexedLabel")
. setPredictionCol ("prediction")
val cv = new CrossValidator ()
.setEstimator(pipeline)
.setEvaluator(evaluator)
. setEstimatorParamMaps (paramGrid)
.setNumFolds (5)

15

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Evaluations
Model F1
Python Scikit prototype 0.878
MLlib local 0.865
MLlib Hadoop (200 executors, production) 0.862
MLlib Hadoop (400 executors) 0.861
MLlib Hadoop (50 executors) 0.857
MLlib Hadoop (2 executors) 0.862
The performance variations among implementations are acceptable

16

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Speed
Stage Data Size Time
Feature Generation 1.73 Billion 6 min
Train 33,000 8 min
Prediction Input 650,000 4 min
Capable of running the job daily
Speed up the metadata discovery process, from months to days

17

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Newly Discovered Brand
Brand Probability
Milkies 0.84
BEABA 0.85
OXO 0.83
Lorex 0.87
Plan Toys 0.85
Safety 1st 0.82
Blabla 0.81
Combi 0.88
Graco 0.88
TotsBots 0.85
Realtree 0.85

18

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Summary
Spark enables fast iterations of ML application development
MLlib is comprehensive, and well integrated with Spark framework
Dev and test locally, straightforward production deployment
Compact code : 600 lines
Need better understanding of the ML algorithm implementations in MLlib

19

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Acknowledgement
Thejas Durgam
Anu Mandalam
Meital Tahar Zahav & eBay SDO Team
Jean-David Ruvini

20

Spark for
Metadata
Discovery
Who We Are
Metadata
Discovery and
Challenges
Spark Solution
Summary
Thank You!
Qing Zhang, qzhang12@ebay.com
Rui Li, ruili1@ebay.com

More Related Content

Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings

  • 1. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Scalable Machine Learning Pipeline for Metadata Discovery from eBay Listings Qing Zhang, Rui Li eBay Spark Summit 2016, June 6-8 San Francisco
  • 2. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Table of Contents 1 Who We Are 2 Metadata Discovery and Challenges 3 Spark Solution 4 Summary
  • 3. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary eBay Structured Data Metadata discovery and management Listing classification Catalog and mapping listing to product Inventory insights
  • 4. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary eBay Structured Data Metadata discovery and management Listing classification Catalog and mapping listing to product Inventory insights
  • 5. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary eBay Structured Data Metadata discovery and management Listing classification Catalog and mapping listing to product Inventory insights
  • 6. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Metadata Zoom In Important name-value pairs: brand - Dell Selling flow item specifics Search navigation Powers internal applications
  • 7. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Metadata Discovery memory box weiss yamaha fisher-price generic modway duluth trading mek usa dnm other sahara club gokey longhorn outdoor gear trax wolverine sk spiderman vintage mixed orchard corset Highly rely on manual review Unfamiliar candidates The same candidate appears in multiple categories
  • 8. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Challenges in Metadata Discovery Term Site Categories scott james US Men’s Clothing: Blazers & Sport Coats Men’s Clothing: Pants Men’s Clothing: Casual Shirts Men’s Clothing: Dress Shirts tiella US Chandeliers & Ceiling Fixtures Lighting Parts & Accessories turf US Sports Mem, Cards & Fan Shop: Cards: Football turf UK Collectables: Cigarette/Tea/Gum Cards: Cigarette Cards: Other Cigarette Cards
  • 9. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Data Driven Approach for Brand Discovery Utilize seller input item specifics Utilize supply demand signals from sellers and buyers Training data available from previously reviewed candidates
  • 10. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Supervised Machine Learning Approach Data : 35,000 previously human reviewed metadata candidates Feature : supply and demand signals Prototypes with Python Scikit Logistic regression, gradient boosting trees, random forest etc Random forest F1 0.878
  • 11. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Production Development In the past, train offline and implement prediction component on production File transferring and configurations are time-consuming Dev Local Production
  • 12. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Spark and Spark MLlib MLlib GraphX Spark provides powerful data processing APIs MLlib is a comprehensive machine learning package powered by Spark Regression, classification, clustering, dimensionality reduction etc Efficient development with local model, and flexible file access
  • 13. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary The Machine Learning SystemSpark Spark MLlib Feature Training Prediction
  • 14. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Model Training with MLlib val pipeline = new Pipeline () .setStages(Array(labelIndexer , featureIndexer , rf , labelConverter )) val evaluator = new MulticlassClassificationEvaluator () .setLabelCol("indexedLabel") . setPredictionCol ("prediction") val cv = new CrossValidator () .setEstimator(pipeline) .setEvaluator(evaluator) . setEstimatorParamMaps (paramGrid) .setNumFolds (5)
  • 15. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Evaluations Model F1 Python Scikit prototype 0.878 MLlib local 0.865 MLlib Hadoop (200 executors, production) 0.862 MLlib Hadoop (400 executors) 0.861 MLlib Hadoop (50 executors) 0.857 MLlib Hadoop (2 executors) 0.862 The performance variations among implementations are acceptable
  • 16. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Speed Stage Data Size Time Feature Generation 1.73 Billion 6 min Train 33,000 8 min Prediction Input 650,000 4 min Capable of running the job daily Speed up the metadata discovery process, from months to days
  • 17. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Newly Discovered Brand Brand Probability Milkies 0.84 BEABA 0.85 OXO 0.83 Lorex 0.87 Plan Toys 0.85 Safety 1st 0.82 Blabla 0.81 Combi 0.88 Graco 0.88 TotsBots 0.85 Realtree 0.85
  • 18. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Summary Spark enables fast iterations of ML application development MLlib is comprehensive, and well integrated with Spark framework Dev and test locally, straightforward production deployment Compact code : 600 lines Need better understanding of the ML algorithm implementations in MLlib
  • 19. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Acknowledgement Thejas Durgam Anu Mandalam Meital Tahar Zahav & eBay SDO Team Jean-David Ruvini
  • 20. Spark for Metadata Discovery Who We Are Metadata Discovery and Challenges Spark Solution Summary Thank You! Qing Zhang, qzhang12@ebay.com Rui Li, ruili1@ebay.com