23 Big Data and Data Wrangling
23 Big Data and Data Wrangling
23 Big Data and Data Wrangling
Data Wrangling
The Roots: mid-1990s
# of tasks
tasks will take a very long time
• No pipelining, so no further work until all
stragglers are done latency
• Solution: just use the fault tolerance mechanism!
• As the MapReduce phase is close to finishing, simply reassign remaining in-
progress tasks to additional nodes
• Whichever finishes first is used!
• 44% speedups reported in the MapReduce paper
• Spend resources to improve completion time
Hadoop
• Open source clone of Google infrastructure
• Started with Doug Cutting and Mike Cafarella as Nutch
• Scaled at Yahoo, released 2006
• Still evolving/maturing in open source
• HDFS a clone of GFS
• Hadoop MapReduce a clone of Google MapReduce
• Written in Java, rather than C++
• Permission granted for a generation of open source
backend software to be written in Java!
Big Data Mayhem Ensues in Research
• What else can we do with MapReduce?
• How can we tweak MapReduce implementations?
• MapReduce, MapReduce, MapReduce…
Storage
Big Data Usage: SQL, Transformation, and more
• Majority of workloads on MapReduce/Spark are written in SQL!
• Mostly Data Analytics: the kind of queries run on parallel DBMS
• But also in other APIs/languages – truly multi-model, multi-language
• E.g. basic machine learning over Spark is fairly common
• Large fraction of Big Data usage is for Data Transformation
• A.k.a. Data Wrangling
• Address the Variety aspect of Big Data
Data Wrangling
Enterprise Data Analysis and Visualization: An Interview Study
“In practice it tends not to be just data prep, you are learning about the data at
the same time, you are learning about what assumptions you can make.”
The 80% Problem
It’s impossible to overstress this: 80% of the work
in any data project is in cleaning the data.
+ Type Inference
+ Predictive Interaction
+ Immediate visual feedback
[Kandel, Heer & Hellerstein, CHI 11]
WHY ARE END USERS DISCONNECTED FROM BIG DATA?
IT’S AN INTERACTION PROBLEM
HINTS OF INTELLIGENT INTERFACES
GUIDE DECIDE
t
interact predic pick
View Response Preview Result
Visualization and Interaction
49
[Heer, Hellerstein, Kandel, CIDR15]
Demo in Google Cloud Dataprep
Data Wrangling Summary
• Make sure you learn basic UNIX tools
• If you must write wrangling scripts…
• Remember, they are code!
• Use version control, testing, other SW eng. methodologies!
• Consider upgrading to modern data prep tech
• You have more interesting code to write!
• Wrangling result can be a “view”, computed on the fly
• You can store it or not as a performance/freshness
decision
Big Data Summary
Big Data vs Parallel RDBMS?
• More similar than different
• Big Data analytics roughly a parallel RDBMS query engine over a big file system
• Big Data implementations break up a DBMS into components
• Increasingly common in new RDBMS implementations (e.g. in the cloud)
• Big Data offers more APIs than SQL
• Some similar efforts in RDBMS, but less popular/polished
• Fault tolerance
• RDBMS still tend to offer transactions, and rarely
handle mid-query failure
• Big Data focus on mid-query fault tolerance
“Big” Data System Design Questions
• Does your system need to be Google scale?
• Is there a tradeoff between scalability and other performance goals?
• Fault Tolerance: how often will one of your machines fail?
• Mid-task or between-task recovery?
• Ratio of task run-time to mean-time-to-failure
• Ratio of recovery time to mean-time-to-failure
• Do you need geo-distribution?
• Strong effect on latency!
• Answers to these questions affect many things:
• Data layout, transactions, query processing, recovery, scheduling …
“Big” Data Usage Questions
• Volume:
• Do you need more than a single computer?
• Probably you should use the cloud these days
• Choose between cloud database and Big Data infrastructure
• Today, can run both over raw files a la Big Data
• Indexing benefits etc. to loading into a cloud database
• Variety: When do you want to design schemas?
• Before loading data for a canonical view
• “Data Warehouse”, wrangle-to-load a la ETL
• After loading data, on a per-use-case basis
• “Data Lake”, wrangle per use case
• At time of query
• MapReduce, query via wrangling
More Big Data Usage Questions
• Velocity: Do you need fast updates?
• Probably use an “operational” DBMS
• Transactional DB is a good choice
• If you have very fast updates and no need for consistency, consider a NoSQL DB
• More on this next lecture!
• You can ETL the data into an analytic Data Warehouse in background
• See above
• Velocity: Do you have truly real-time streaming data?
• Are you sure? You are unusual!
• “Real time is for robots”!
• There are various open-source projects you can
look at:
Heron, Flink, Kafka, Spark Streaming …