Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

23 Big Data and Data Wrangling

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Big Data and

Data Wrangling
The Roots: mid-1990s

• Relational DBMS market was maturing


• The World Wide Web had just come on the scene
• Shared-Memory servers challenged by shared-nothing clusters
• DBMS market had been seeing this since the late 1980’s
• General-purpose markets beginning to follow
• Berkeley NOW (Network of Workstations) project popularizing the idea
Search Engine Experience, Mid-1990s
• Eric Brewer and students at Berkeley exploring web search
• Tried using shared-nothing transactional databases
• Lesson #1: Transactions can hamper Availability at scale
• Node failure increasingly common as you scale up #nodes
• Distributed transactions (2PC) requires all nodes to participate!
• Built a custom system for search: Inktomi
• And a company that at one time was market leader
• Lesson #2: It’s OK to be inconsistent in web search
• The story of the Berkeley->Foster City move
Google Infrastructure papers, Early 2000’s
• Google, following Inktomi, built in a shared-nothing fashion
• Lots of custom code for large, data centric tasks
• E.g. build an inverted index from a web crawl
• Need to store lots of really big files
• Google File System (GFS) 2003
• Need to write data-parallel programs on those files
• MapReduce 2004
Why Not an RDBMS? Business Reasons
• Can’t commit to an Internet service business powered by 3rd-party engines
• Relational database pricing model not suited
• No open-source parallel DBMS!
Why not an RDBMS? Technical Reasons
• No transactions, please!
• High availability at scale was not a viable option in RDBMSs of the time
• Scale: #Queries, size of data, #machines beyond RDBMS market
• 100 PB index, 3.5B searches per day, >2.5M servers (2016/17)
• Lots of non-relational data to manage
• Inverted indexes in RDBMS not that efficient (see earlier lectures)
• Cache of full-text html, images, video, …
• Programming model & Not-Invented-Here (NIH)
• They *could* have built around a custom SQL core
• Systems engineering culture tends to result in
hand-code in lower-level languages
Google File System
• Large, append-only files with sequential scan acess
• Byte-stream based (like UNIX), not record based (like System R)
• Focus on:
• Scalability
• Reliability
• Availability
• No indexing
Objectives: load balancing,
Block Placement fast access, fault tolerance

Block Block Block Block Block


1 2 3 2 2
Block Block Block
Example: 3 1 3
Block
1

Node 1 Node 2 Node 3 Node 4 Node 5


• Default placement policy: (e.g. replication factor = 3)
• First copy is written to the node creating the file (write affinity)
• Second copy is written to a data node within the same rack
(to minimize cross-rack network traffic)
• Third copy is written to a data node in a different rack
(to tolerate switch failures)
GFS Architecture
NameNode BackupNode

(heartbeat, balancing, replication, etc.)

DataNode DataNode DataNode DataNode DataNode


Failures, Failures, Failures
• GFS paper: “Component failures are the norm rather than the exception.”

Failure types: DataNode


 Disk errors and failures
 DataNode failures
 Switch/Rack failures
NameNode
 NameNode failures
 Datacenter failures
MapReduce
• In many ways a reinvention of things we’ve seen
• Shared-nothing architecture
• Partition parallelism
• No pipeline parallelism (!!)
MapReduce Data and Programming Model
• Data model: (Key, Value) pairs
• Opaque (user-interpreted) data types

• Map function: (Kin, Vin)  list(Kinter, Vinter)


• Single-node, one input record at a time

• Reduce function: (Kinter, Vinter)  list(Kout, Vout)


MapReduce Data Flow
the, 1
brown, 1
the quick fox, 1 brown, 2
brown Map
fox, 2
fox Reduce how, 1
now, 1
the, 1
fox, 1
the, 3
the, 1
the fox ate
the mouse Map
quick, 1
how, 1 ate, 1
now, 1 ate, 1 cow, 1
brown, 1 mouse, 1 Reduce
how now mouse, 1
brown Map cow, 1 quick, 1
cow

Global File System Local disks Global File System


Basic MapReduce Control Flow Master

1. Master identifies data needed from FS


2. Master creates execution plan assigning file
splits & MR tasks to workers
3. Master submits tasks to workers Worker Worker Worker
4. Each worker does the job and reports
progress/status to Master
5. Master coordinates task phases FileSystem
1. Map, Reduce, Map, Reduce, …
MapReduce Fault Recovery Master

• If worker identifies a failed task:


• Re-fetch input and re-run
• Requires immutable input, deterministic
functions, no side effects Worker Worker Worker
• If master identifies a failed node:
• Reassign task to another node
FileSystem
• See above requirements
Common Problem: “Stragglers”
• Misconfigured/broken nodes lead to slow tasks
• Zipfian “tail latency”: high probability that a few

# of tasks
tasks will take a very long time
• No pipelining, so no further work until all
stragglers are done latency
• Solution: just use the fault tolerance mechanism!
• As the MapReduce phase is close to finishing, simply reassign remaining in-
progress tasks to additional nodes
• Whichever finishes first is used!
• 44% speedups reported in the MapReduce paper
• Spend resources to improve completion time
Hadoop
• Open source clone of Google infrastructure
• Started with Doug Cutting and Mike Cafarella as Nutch
• Scaled at Yahoo, released 2006
• Still evolving/maturing in open source
• HDFS a clone of GFS
• Hadoop MapReduce a clone of Google MapReduce
• Written in Java, rather than C++
• Permission granted for a generation of open source
backend software to be written in Java!
Big Data Mayhem Ensues in Research
• What else can we do with MapReduce?
• How can we tweak MapReduce implementations?
• MapReduce, MapReduce, MapReduce…

• My take at the time:


• “The best thing about MapReduce is that
everybody is interested in MapReduce”
• Finally, Computer Science has discovered data!
• They’ll figure out more about it in time…
DBMS Elders Less Patient
Big Data
• Phrase popularized by industry analysts and tech press
• Fostered discussion of new use cases as well as the new tech
• The 3 V’s of Big Data
• Volume
• Velocity
• Variety
Hadoop Market
• Open-source clone of GFS and MapReduce at Yahoo
• Written in Java
• Kickoff of the Big Data Industry
• Cloudera+Hortonworks
• > $1B in VC funding
• Both IPOed, recently merged (2018)
• MapR: the “other” Hadoop company
• Focus: a better filesystem
• Many more
• AWS, MS, Intel, IBM, EMC
Spark
• Began as a Berkeley research project
• Replacement for Hadoop MapReduce (but not HDFS)
• Originally a demo to show off another project (mesos)
• Main innovations
• Cache in memory when possible, not on disk
• Scala API more than just MapReduce
• Analogous to relational algebra, with a bit more
• Somewhat different FT mechanism
• Avoid checkpointing; recompute more aggressively
• Second wave of commercialization
• Databricks another ~$250M in VC
As Time Has Passed
• SQL returned!
• Multiple implementations of SQL over MapReduce, Spark
• Hive, Presto, SparkSQL, …
• Still maturing, but the most common API for MapReduce/Spark
The DBMS, Decoupled!
A good thing?
A bad thing?
API / Query Language SQL
Scheduler
GP
Query Optimizer ORCA

Ingest Dataflow Engine


Workflow

Storage
Big Data Usage: SQL, Transformation, and more
• Majority of workloads on MapReduce/Spark are written in SQL!
• Mostly Data Analytics: the kind of queries run on parallel DBMS
• But also in other APIs/languages – truly multi-model, multi-language
• E.g. basic machine learning over Spark is fairly common
• Large fraction of Big Data usage is for Data Transformation
• A.k.a. Data Wrangling
• Address the Variety aspect of Big Data
Data Wrangling
Enterprise Data Analysis and Visualization: An Interview Study

Interview study of 35 analysts

25 companies Various titles


Healthcare Data analyst
Retail, Marketing Data scientist
Social networking Software engineer
Media Consultant
Finance, Insurance Chief technical officer

Kandel et al. “Enterprise Data Analysis and Visualization: An Interview Study.


IEEE Visual Analytics Science & Technology (VAST), 2012
http://db.cs.berkeley.edu/papers/vast12-interview.pdf
“I spend more than half of my time integrating, cleansing and transforming
data without doing any actual analysis. Most of the time I’m lucky if I get to
do any ‘analysis’ at all…
… Most of the time once you transform the data ... the insights can be scarily
obvious.”
“Once you play with the data you realize you made an assumption that is
completely wrong. It’s really useful, it’s not just a waste of time, even though
you may be banging your head.”

“In practice it tends not to be just data prep, you are learning about the data at
the same time, you are learning about what assumptions you can make.”
The 80% Problem
It’s impossible to overstress this: 80% of the work
in any data project is in cleaning the data.

– DJ Patil, Data Jujitsu, O’Reilly Press 2012


So true it’s funny…
Data Wrangling
Aka Data Prep, Data Munging, Data Transformation
Assessing and transforming raw data to make it fit for use

Fit for what use? That depends!


Data Wrangling
Aka Data Prep, Data Munging, Data Transformation
Assessing and transforming raw data to make it fit for use

This is how you “get your head in the game”


• Understand what you have
• Assess strengths and weaknesses of your data
• Hypothesize about what to do with your data
• Get it ready

Nobody will know your data


as well as you do while wrangling
• Not even the “you” of a few days later
Stages of Wrangling
Raw: Data ingestion & discovery (“unboxing”)
• What: Exploratory ad hoc analysis
• Who: Data Analysts, Data Scientists

Refined: Curating data for reuse


• What: Data warehousing, canonical models
• Who: Data curators, stewards Rattenbury, et al. “Data Wrangling: Techniques and Concepts for Agile Analytics”.
To appear, O’Reilly Media, 2017.

Production: Ensuring feeds and workflows


• What: Recurrent, automated use cases:
• Traditional (e.g. reporting) + New (e.g. recommenders)
• Who: Data Engineers
Today
We will focus on the “Raw➝Refined” stage
• Unboxing
• Transformation to analytics-ready structure
• Assessment/mitigation of quality issues

But first, some stage-setting from John Tukey


• Famous statistician
• Invented the FFT, coined the term “bit”
• Established the field of Exploratory Data Analysis (EDA)
• Father of modern Data Visualization
Data Analysis & Statistics, Tukey 1965
Some implications for effective data analysis are:
(1) that it is essential to have convenience of
interaction of people and intermediate results and
(2) that at all stages of data analysis, the nature
and detail of output, both actual and potential,
need to be matched to the capabilities of the
people who use it and want it.
Some implications for effective data analysis are:
(1) that it is essential to have convenience of
interaction of people and intermediate results and
(2) that at all stages of data analysis, the nature
and detail of output, both actual and potential,
need to be matched to the capabilities of the
people who use it and want it.
Nothing - not the careful logic of mathematics, not
statistical models and theories, not the awesome
arithmetic power of modern computers - nothing
can substitute here for the flexibility of the
informed human mind.
Nothing - not the careful logic of mathematics, not
statistical models and theories, not the awesome
arithmetic power of modern computers - nothing
can substitute here for the flexibility of the
informed human mind.
Unboxing Data
• What do I have here?
• What do I want to do with it?

These questions rarely have pat answers.


• Typically contextual and user-driven
• Typically subject to iterative cycles of wrangling and analysis
Rough Guide to Wrangling Issues
• Structure: the “shape” of a data file
• Granularity: how fine/coarse is each datum
• Faithfulness: how well does the data capture “reality”
• Temporality: how is the data situated in time
• Scope: how (in)complete is the data

Many of these are subjective qualities! Depend on context.

(Data) Science is a human process.


Common Tools for Wrangling
• Open Source/Desktop
• UNIX commands
• Relational algebra and extensions for wrangling
• Python: the Pandas library
• R: the tidyR library
• Old-school Extract-Transform-Load systems (ETL)
• Box-and-Arrows programming, scalable backends
• Informatica, Talend (also Alteryx for desktop)
• Data Preparation tools
• Visualization + Transformation combined
• Trifacta, Google Dataprep, Paxata
Wrangling with UNIX Command Line
• File metadata
• ls -lh
• file
• wc
• File (de)compression
• gunzip, zip, bzip, etc.
• stdout and the pipe
• File content:
• cat
• head
• tail
• less
• <ctrl>-C
• Learn this stuff!
RESEARCH ROOTS OF DATA PREP: OPEN SOURCE DATA
WRANGLER, 2011

+ Type Inference
+ Predictive Interaction
+ Immediate visual feedback
[Kandel, Heer & Hellerstein, CHI 11]
WHY ARE END USERS DISCONNECTED FROM BIG DATA?
IT’S AN INTERACTION PROBLEM
HINTS OF INTELLIGENT INTERFACES

Type-ahead uses context and


data to predict search terms
and preview results.
PREDICTIVE INTERACTION

GUIDE DECIDE
t
interact predic pick
View Response Preview Result
Visualization and Interaction

visualize codegen present compile

write code, compile, run


Data Output
Domain Specific Language (DSL)

49
[Heer, Hellerstein, Kandel, CIDR15]
Demo in Google Cloud Dataprep
Data Wrangling Summary
• Make sure you learn basic UNIX tools
• If you must write wrangling scripts…
• Remember, they are code!
• Use version control, testing, other SW eng. methodologies!
• Consider upgrading to modern data prep tech
• You have more interesting code to write!
• Wrangling result can be a “view”, computed on the fly
• You can store it or not as a performance/freshness
decision
Big Data Summary
Big Data vs Parallel RDBMS?
• More similar than different
• Big Data analytics roughly a parallel RDBMS query engine over a big file system
• Big Data implementations break up a DBMS into components
• Increasingly common in new RDBMS implementations (e.g. in the cloud)
• Big Data offers more APIs than SQL
• Some similar efforts in RDBMS, but less popular/polished
• Fault tolerance
• RDBMS still tend to offer transactions, and rarely
handle mid-query failure
• Big Data focus on mid-query fault tolerance
“Big” Data System Design Questions
• Does your system need to be Google scale?
• Is there a tradeoff between scalability and other performance goals?
• Fault Tolerance: how often will one of your machines fail?
• Mid-task or between-task recovery?
• Ratio of task run-time to mean-time-to-failure
• Ratio of recovery time to mean-time-to-failure
• Do you need geo-distribution?
• Strong effect on latency!
• Answers to these questions affect many things:
• Data layout, transactions, query processing, recovery, scheduling …
“Big” Data Usage Questions
• Volume:
• Do you need more than a single computer?
• Probably you should use the cloud these days
• Choose between cloud database and Big Data infrastructure
• Today, can run both over raw files a la Big Data
• Indexing benefits etc. to loading into a cloud database
• Variety: When do you want to design schemas?
• Before loading data for a canonical view
• “Data Warehouse”, wrangle-to-load a la ETL
• After loading data, on a per-use-case basis
• “Data Lake”, wrangle per use case
• At time of query
• MapReduce, query via wrangling
More Big Data Usage Questions
• Velocity: Do you need fast updates?
• Probably use an “operational” DBMS
• Transactional DB is a good choice
• If you have very fast updates and no need for consistency, consider a NoSQL DB
• More on this next lecture!
• You can ETL the data into an analytic Data Warehouse in background
• See above
• Velocity: Do you have truly real-time streaming data?
• Are you sure? You are unusual!
• “Real time is for robots”!
• There are various open-source projects you can
look at:
Heron, Flink, Kafka, Spark Streaming …

You might also like