Parallel & Distributed Computing
Parallel & Distributed Computing
Serial Computing
All processors may have access to a shared memory to exchange information between processors
Parallel Computing
Distributed Computing
Computer Cluster
MapReduce
What is MapReduce?
Framework developed by Google for processing parallelizable problems across large data sets using a large cluster of computers (nodes) Programing paradigm, splits up a task into smaller subtasks that can be executed in parallel & therefore run faster compared to a single computer execution Simplified data processing on large clusters a large server farm can use MapReduce to sort a petabyte of data in only a few hours Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured)
MapReduce Overview
MapReduce Architecture
Job Scheduling System
o Jobs made up of tasks, master scheduler assigns tasks to slave machines (nodes), easy to distribute across nodes o Input, final output are stored on a distributed file system o Master pings slaves periodically to detect failures o Slaves send heartbeats back to master periodically o Master responds with task if a slot is free, picking task with data closest to the node o Takes advantage of locality of data, processing data on or near the storage assets to decrease transmission of data
Fault Tolerance
o Parallelism offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled
MapReduce Dataflow
Consists of a single master JobTracker and one slave TaskTracker per cluster-node Dataflow: an input reader a Map function a partition function a compare function a Reduce function an output writer
MapReduce Dataflow
Reader reads data from stable storage (typically a distributed file system) and generates key/value pairs A common example will read a directory full of text files and return each line as a record
Each Reduce call typically produces either one value or an empty return, which is collected as the desired result list
MapReduce Uses
MapReduce aids organizations in processing and analyzing large volumes of multi-structured data, are often difficult to implement using the standard SQL employed by relational DBMSs Uses Include: distributed pattern-based searching, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web, replacing the old ad hoc programs that updated the index and ran the various analyses
Hadoop
What is Hadoop?
Open source Java software framework to support dataintensive distributed (file system) applications provided by Apache Software Foundation Solves the problem of a tremendous amount of data that needs to be analyzed and processed very quickly Allows for the distributed processing of large data sets across clusters of computers using simple programming models Delivers a highly-available service on top of a cluster of computers Designed to scale up from single servers to thousands of machines and to detect and handle failures at the application layer
What is Hive?
Hive has gained the most acceptance in the industry Main benefit - dramatically improves the simplicity and speed of MapReduce development Its SQL-like syntax makes it easy to use by nonprogrammers who are comfortable using SQL HiveQL statements entered using a command line or Web interface, or may be embedded in applications that use ODBC and JDBC interfaces to the Hive system Hive Driver system converts the query statements into a series of MapReduce jobs Data files in Hive are seen in the form of tables (and views), but do not support the concepts of primary or foreign keys or constraints of any type
Hadoop Architecture
Runs best on Linux machines, working directly with the underlying hardware Utilize rack servers (not blades) populated in racks connected to the top of rack switch Majority of the servers will be Slave nodes with lots of local disk storage and moderate amounts of CPU and DRAM Some machines will be Master nodes that might have a slightly different configuration favoring more DRAM and CPU, less local storage
Hadoop