Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Advanced Analytics at
Macys.com
Daqing Zhao, PhD
Director of Advanced Analytics, Macys.com
June 2, 2014
Agenda
• Big data analytics and traditional BI
• Challenges and solutions of big data predictive modeling
• Macy’s Advanced Analytics Team
• Our analytics projects
• Personalized site recommendations
• Response propensity models
• Best practices of analysts and modeling
2
Daqing Zhao, PhD
• There are two types of data scientists
– “DATA scientist” for Big Data infrastructure
– “data SCIENTIST” for Big Data domain problems
• A Big Data scientist with deep domain knowledge
• Academic training
– Analyzed molecular spectra on Cray supercomputers
– Determined, modeled, simulated molecular motions in 3D space
• Worked on computational Internet marketing since 1999
3
Big data, Big Opportunities
• Thanks to Moore’s law, on CPU, storage, network connections
• Too much data, too little knowledge
• Data, analytics changed every field many times over
• From science, government, to commerce
4
Traditional BI process
• Data can be accessed and analyzed only after ETL
• Schema definition may not be optimal
5
Knowledge
Discovery
Segmentation and
Predictive
Modeling
Multidimensional Report
Standard Report
Schema definition, ETL into RDBMS
Most companies
Stay in this area
Baseline Consulting
Modeling in the Big Data era
• Challenges:
– Modeling needs to scale
– Timeliness of models
– It takes time to integrate
– Test and Experimentation
• Solutions:
– Big data warehouse solutions
– Separation of concerns
– Scalable modeling tools
– Best practices in modeling
6
• Traditional predictive models take long time to build
• Now data are cheap and models may degrade in weeks
– Dimension of predictors are very large
– Number of categories are large
• Human interactive model building not scalable
• Reasons for target events are complex
• Without detailed analysis, it is unclear what drives the event
• We need to rely on out of sample testing
time shifted testing and off the shelf modeling
Modeling needs to scale
7
Importance of timeliness of models
• Traditionally models built only if predictive power
persists for a long time
• Some model performances degrade quickly
• If we cannot build and update models in time
– We cannot benefit from many useful patterns
• If we don’t use models or use outdated models
– We literally would be driving blindfolded
8
It takes time to integrate
• Make sure the data are in place
• Measurement and attribution
• Start conversations about model based decisions
• Teams need to think in model metrics
• Organization needs to adapt
• Accumulate assets of creatives, best practices
9
1 2 3 4 5
Test and Experimentation
• Testing and experimentation is key to success
• Customer response behavior is complex
• New different versions, new models, new messages
• Do split traffic tests for web or email
• Find the winners, and gain learning
• Often there are test design problems and
understanding their implications
10
Big data warehouse
• Data size larger than what databases can handle
• Terabytes of data may take hours just to scan it
• Solution requires a cloud of servers with local storage
– Read, process and write intermediate results in parallel
– Aggregate at the end
• Cloud computing can build models in scale
• Cloud often scales linearly as number of servers
11
Separation of concerns
• Solution complexity
• Data complexity
• Variability of requirements
• Standard data mining algorithms
• Availability
• Reliability, scalability, latency
• CPU, help and disk IO issues
1
2
Platform
Engineering
Data
Scientist
Scalable modeling tools
• Out of sample testing, cross validation
• Fast and scalable modeling algorithms
• Model comparisons and selections, model management tools
• Automated model optimization tools
• Penalize models being unnecessarily too large
• Ensemble models
• Robust models, handling missing variables, and outliers
• Convenient model building environment
• Graphical tools
• Model deployment tools
13
Best practice big data modeling
• Understand how the data are collected, what data
can and cannot be collected
• Balance cost of collecting data and optimize modeling
• Model performance depends on quality of data
• Use automated, robust model building solutions
• Use feedback loop to test hypotheses
• Do simulations to see if changes are reasonable
• Good ideas are not necessarily complicated
• Focus on domain knowledge,
not just data mining tools
14
Macy.com’s Advanced Analytics
• We are at the frontiers of Big Data science
• We have predictive modeling, experimental design and data
science teams
• Our team members have very strong background in
– Quantitative fields, math, stat, physics, bioinformatics, decision
sciences, and computer science
– We collaborate with systems and IT teams internally as well as 3rd
party vendors like WibiData, SAS Research, IBM Research…
• We use a wide range of tools Hadoop, SAS, SAP/KXEN, R,
Mahout, and others
• We are data scientists with keen focus on domain problems
15
Customer acquisition and retention
• Targeting the right message to the right customer at the
right time
– Build predictive models of purchase behavior and identify
drivers
• Site recommendation algorithms
– Most work is in batch mode, expanding slowly into real time
• Rapid-prototyping and testing of algorithms and policies
• Output of the team’s work support other marketing teams
to identify, and reach best customers
16
Some other projects
• Data organization or data munging
– Data collections, individual and event level, 360 degrees, …
– Segmentation of customers
– Customer value
– Multiple channel attribution
• Experimentation platform
– Both for site layout as well as contents and recommendations
• Forecast and optimization
– Prediction, simulation, and search and optimize
• Big data refinement and scalability
– Find new data sources, more efficient ways of accessing data, and
organizing and processing data
17
Product recommendations
18
Macys.com’s site personalization
19
Customer segmentation
20
Demographic
Socio-economic
Behavioral
Values and styles
Channels
Modality
Product social network
21
Demographic
Style
Size
Brand
Price range
Season
Who gets which email?
22
Propensity Models
23
We are building an expanding family of models, at
Category/Brand/Outfit…
If customer, by category, bought Online/Store, browsed,
A2B, Email open/clicks, gender, age, store proximity,
Personicx, recency/frequency/spent…
Predict buy
Men’s
Clothes ?
Next 4
weeks
Up to two years observation
Data most important
• In finding insights and modeling, find key data most
important
– Identify the smoking gun
– Data definition and quality
• Data transformations
– PageRank is a game changing data transformation
– Social graph is a key data transformation for credit
card fraud detection
24
Every wrong data is wrong in its own way
• Some data are not collected, “too big” or “useless”, as in flood
control, purged log data
• Some data feeds to warehouse are incomplete
• Multiple definitions and inconsistent business rules, no
documentation
• Data incomplete due to business nature
• Some flaws are easy to catch, such as missing, constant
• Some flaws hard to find, partially missing or incorrect
25
More analysis leads to better quality
26
Data Collection
Exploratory
Analysis
Predictive
Modeling
Decision
Algorithms
Better data quality
Concluding thoughts
• Big data presents big opportunity and big challenges
• Data science is not about data, but domain solutions
• Modeling in Big Data era is different from traditional practices
• Organizations need to adapt to model based decisions
• Data are not clean until thoroughly analyzed
• Scalable and efficient modeling tools are essential
27

More Related Content

TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com

  • 1. Advanced Analytics at Macys.com Daqing Zhao, PhD Director of Advanced Analytics, Macys.com June 2, 2014
  • 2. Agenda • Big data analytics and traditional BI • Challenges and solutions of big data predictive modeling • Macy’s Advanced Analytics Team • Our analytics projects • Personalized site recommendations • Response propensity models • Best practices of analysts and modeling 2
  • 3. Daqing Zhao, PhD • There are two types of data scientists – “DATA scientist” for Big Data infrastructure – “data SCIENTIST” for Big Data domain problems • A Big Data scientist with deep domain knowledge • Academic training – Analyzed molecular spectra on Cray supercomputers – Determined, modeled, simulated molecular motions in 3D space • Worked on computational Internet marketing since 1999 3
  • 4. Big data, Big Opportunities • Thanks to Moore’s law, on CPU, storage, network connections • Too much data, too little knowledge • Data, analytics changed every field many times over • From science, government, to commerce 4
  • 5. Traditional BI process • Data can be accessed and analyzed only after ETL • Schema definition may not be optimal 5 Knowledge Discovery Segmentation and Predictive Modeling Multidimensional Report Standard Report Schema definition, ETL into RDBMS Most companies Stay in this area Baseline Consulting
  • 6. Modeling in the Big Data era • Challenges: – Modeling needs to scale – Timeliness of models – It takes time to integrate – Test and Experimentation • Solutions: – Big data warehouse solutions – Separation of concerns – Scalable modeling tools – Best practices in modeling 6
  • 7. • Traditional predictive models take long time to build • Now data are cheap and models may degrade in weeks – Dimension of predictors are very large – Number of categories are large • Human interactive model building not scalable • Reasons for target events are complex • Without detailed analysis, it is unclear what drives the event • We need to rely on out of sample testing time shifted testing and off the shelf modeling Modeling needs to scale 7
  • 8. Importance of timeliness of models • Traditionally models built only if predictive power persists for a long time • Some model performances degrade quickly • If we cannot build and update models in time – We cannot benefit from many useful patterns • If we don’t use models or use outdated models – We literally would be driving blindfolded 8
  • 9. It takes time to integrate • Make sure the data are in place • Measurement and attribution • Start conversations about model based decisions • Teams need to think in model metrics • Organization needs to adapt • Accumulate assets of creatives, best practices 9 1 2 3 4 5
  • 10. Test and Experimentation • Testing and experimentation is key to success • Customer response behavior is complex • New different versions, new models, new messages • Do split traffic tests for web or email • Find the winners, and gain learning • Often there are test design problems and understanding their implications 10
  • 11. Big data warehouse • Data size larger than what databases can handle • Terabytes of data may take hours just to scan it • Solution requires a cloud of servers with local storage – Read, process and write intermediate results in parallel – Aggregate at the end • Cloud computing can build models in scale • Cloud often scales linearly as number of servers 11
  • 12. Separation of concerns • Solution complexity • Data complexity • Variability of requirements • Standard data mining algorithms • Availability • Reliability, scalability, latency • CPU, help and disk IO issues 1 2 Platform Engineering Data Scientist
  • 13. Scalable modeling tools • Out of sample testing, cross validation • Fast and scalable modeling algorithms • Model comparisons and selections, model management tools • Automated model optimization tools • Penalize models being unnecessarily too large • Ensemble models • Robust models, handling missing variables, and outliers • Convenient model building environment • Graphical tools • Model deployment tools 13
  • 14. Best practice big data modeling • Understand how the data are collected, what data can and cannot be collected • Balance cost of collecting data and optimize modeling • Model performance depends on quality of data • Use automated, robust model building solutions • Use feedback loop to test hypotheses • Do simulations to see if changes are reasonable • Good ideas are not necessarily complicated • Focus on domain knowledge, not just data mining tools 14
  • 15. Macy.com’s Advanced Analytics • We are at the frontiers of Big Data science • We have predictive modeling, experimental design and data science teams • Our team members have very strong background in – Quantitative fields, math, stat, physics, bioinformatics, decision sciences, and computer science – We collaborate with systems and IT teams internally as well as 3rd party vendors like WibiData, SAS Research, IBM Research… • We use a wide range of tools Hadoop, SAS, SAP/KXEN, R, Mahout, and others • We are data scientists with keen focus on domain problems 15
  • 16. Customer acquisition and retention • Targeting the right message to the right customer at the right time – Build predictive models of purchase behavior and identify drivers • Site recommendation algorithms – Most work is in batch mode, expanding slowly into real time • Rapid-prototyping and testing of algorithms and policies • Output of the team’s work support other marketing teams to identify, and reach best customers 16
  • 17. Some other projects • Data organization or data munging – Data collections, individual and event level, 360 degrees, … – Segmentation of customers – Customer value – Multiple channel attribution • Experimentation platform – Both for site layout as well as contents and recommendations • Forecast and optimization – Prediction, simulation, and search and optimize • Big data refinement and scalability – Find new data sources, more efficient ways of accessing data, and organizing and processing data 17
  • 22. Who gets which email? 22
  • 23. Propensity Models 23 We are building an expanding family of models, at Category/Brand/Outfit… If customer, by category, bought Online/Store, browsed, A2B, Email open/clicks, gender, age, store proximity, Personicx, recency/frequency/spent… Predict buy Men’s Clothes ? Next 4 weeks Up to two years observation
  • 24. Data most important • In finding insights and modeling, find key data most important – Identify the smoking gun – Data definition and quality • Data transformations – PageRank is a game changing data transformation – Social graph is a key data transformation for credit card fraud detection 24
  • 25. Every wrong data is wrong in its own way • Some data are not collected, “too big” or “useless”, as in flood control, purged log data • Some data feeds to warehouse are incomplete • Multiple definitions and inconsistent business rules, no documentation • Data incomplete due to business nature • Some flaws are easy to catch, such as missing, constant • Some flaws hard to find, partially missing or incorrect 25
  • 26. More analysis leads to better quality 26 Data Collection Exploratory Analysis Predictive Modeling Decision Algorithms Better data quality
  • 27. Concluding thoughts • Big data presents big opportunity and big challenges • Data science is not about data, but domain solutions • Modeling in Big Data era is different from traditional practices • Organizations need to adapt to model based decisions • Data are not clean until thoroughly analyzed • Scalable and efficient modeling tools are essential 27