An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
Report
Share
Report
Share
1 of 50
More Related Content
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
1. HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters 02/28/11 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email_address] Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
2. Motivation Large-Scale Data Processing Want to use 1000s of CPUs But don ’ t want hassle of managing things MapReduce provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates
3. Map/Reduce Map/Reduce Programming model from Lisp (and other functional languages) Many problems can be phrased this way Easy to distribute across nodes Nice retry/failure semantics
4. Distributed Grep Very big data Split data Split data Split data Split data grep grep grep grep matches matches matches matches cat All matches
5. Distributed Word Count Very big data Split data Split data Split data Split data count count count count count count count count merge merged count
6. Map Reduce Map: Accepts input key/value pair Emits intermediate key/value pair Reduce : Accepts intermediate key/value* pair Emits output key/value pair Very big data Result M A P R E D U C E Partitioning Function
7. Map in Lisp (Scheme) (map f list [list 2 list 3 … ] ) (map square ‘ (1 2 3 4)) (1 4 9 16) (reduce + ‘ (1 4 9 16)) (+ 16 (+ 9 (+ 4 1) ) ) 30 (reduce + (map square (map – l 1 l 2 )))) Unary operator Binary operator
8. Map/Reduce ala Google map(key, val) is run on each item in set emits new-key / new-val pairs reduce(key, vals) is run for each unique key emitted by map() emits final output
9. count words in docs Input consists of (url, contents) pairs map(key=url, val=contents): For each word w in contents, emit (w, “ 1 ” ) reduce(key=word, values=uniq_counts): Sum all “ 1 ” s in values list Emit result “ (word, sum) ”
10. Count, Illustrated map(key=url, val=contents): For each word w in contents, emit (w, “ 1 ” ) reduce(key=word, values=uniq_counts): Sum all “ 1 ” s in values list Emit result “ (word, sum) ” see bob throw see spot run see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 Run 1 see 2 spot 1 throw 1
11. Grep Input consists of (url+offset, single line) map(key=url+offset, val=line): If contents matches regexp, emit (line, “ 1 ” ) reduce(key=line, values=uniq_counts): Don ’ t do anything; just emit line
12. Reverse Web-Link Graph Map For each URL linking to target, … Output <target, source> pairs Reduce Concatenate list of all source URLs Outputs: <target, list (source)> pairs
13. Model is Widely Applicable MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ...
14. Implementation Overview Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs
15. Execution How is this distributed? Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!
16. Job Processing JobTracker TaskTracker 0 TaskTracker 1 TaskTracker 2 TaskTracker 3 TaskTracker 4 TaskTracker 5 Client submits “grep” job, indicating code and input files JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. After map(), tasktrackers exchange map-output to build reduce() keyspace JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. reduce() output may go to NDFS “ grep”
19. Task Granularity & Pipelining Fine granularity tasks: map tasks >> machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map & 5000 reduce tasks Running on 2000 machines
20. MapReduce outside Google Hadoop (Java) Emulates MapReduce and GFS The architecture of Hadoop MapReduce and DFS is master/slave Master Slave MapReduce jobtracker tasktracker DFS namenode datanode
21. Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Download Software at: http://www.eng.auburn.edu/~xqin/software/hdfs-hc This HDFS-HC tool was described in our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.
22. Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
23. One time setup set hadoop-site.xml and slaves Initiate namenode Run Hadoop MapReduce and DFS Upload your data to DFS Run your process… Download your data from DFS
29. Challenges Does computing ratio depend on the application? Initial data distribution Data skew problem New data arrival Data deletion New joining node Data updating
30. Measure Computing Ratios Computing ratio Fast machines process large data sets Time Node A Node B Node C 2x slower 3x slower 1 task/min
31. Steps to Measure Computing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node Node Response time(s) Ratio # of File Fragments Speed Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest
32. Initial Data Distribution Namenode Datanodes File1 6 c Input files split into 64MB blocks Round-robin data distribution algorithm C B A Portion 3:2:1 1 2 3 4 5 7 8 9 a b
33. Data Redistribution 1.Get network topology, the ratio and utilization 2.Build and sort two lists: under-utilized node list L1 over-utilized node list L2 3. Select the source and destination node from the lists. 4.Transfer data 5.Repeat step 3, 4 until the list is empty. 1 Namenode 1 2 3 4 5 6 7 8 9 a b c C A C B A B 2 3 4 L1 L2 Portion 3:2:1
34. Sharing Files among Multiple Applications The computing ratio depends on data-intensive applications. Redistribution Redundancy
35. Experimental Environment Five nodes in a hadoop heterogeneous cluster Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node E Intel Pentium 3 1.2G 256
36. Grep and WordCount Grep is a tool searching for a regular expression in a text file WordCount is a program used to count words in a text file
37. Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5
38. Response time of Grep and wordcount in each Node Application dependence Data size independence
42. Conclusion Identify the performance degradation caused by heterogeneity. Designed and implemented a data placement mechanism in HDFS.
43. Future Work Data redundancy issue Dynamic data distribution mechanism Prefetching
44. Fellowship Program Samuel Ginn College of Engineering at Auburn University Dean's Fellowship: $32,000 per year plus tuition fellowship College Fellowship: $24,000 per year plus tuition fellowship Departmental Fellowship: $20,000 per year plus tuition fellowship. Tuition Fellowships: Tuition Fellowships provide a full tuition waiver for a student with a 25 percent or greater full-time-equivalent (FTE) assignment. Both graduate research assistants (GRAs) and graduate teaching assistants (GTAs) are eligible.
Cite here What is the hadoop? http://www.cloudera.com/what-is-hadoop/ Hadoop is an open-source project administered by the Apache Software Foundation . Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding data. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
Slow down, add animation of speculative task helping Note: Mention that we tested this on a heterogeneous Hadoop cluster
Copy page 7 to here Note: Please add legend on this diagram. E.g., black bars represent ******, red bars represent ***** Show data movement (migration)
Real results for the aforementioned motivational example.
Note: Explain what is the definition of computing ratio
Note: Name the over-utilized node list and the under-utilized node list on the diagram. a->A, b->B
The heterogeneity measurement of a cluster depends on data-intensive applications. If multiple MapReduce applications must process the same input file, the data placement mechanism may need to distribute the input file’s fragments in several ways - one for each MapReduce application. In the case where multiple applications are similar in terms of data processing speed, one data placement decision may fit the needs of all the application
Note: improve the resolution of the figures
Title Application dependence Independenced of Data size Note 1: improve the quality of the figures. Large figures and large legend Note 2: