100% found this document useful (1 vote)

267 views

Parallel & Distributed Computing

This document provides an overview of parallel and distributed computing concepts including serial computing, parallel computing, distributed computing, computer clusters, MapReduce, and Hadoop. Serial computing involves breaking a problem into sequential instructions executed on a single CPU, while parallel computing breaks a problem into discrete parts that can execute simultaneously on multiple CPUs. Distributed computing coordinates shared resources across networked computers to appear as a single system. Computer clusters connect multiple computers to function as a single system for improved performance. MapReduce is a framework that simplifies parallel processing of large datasets across clusters by mapping data to nodes for processing, then reducing the results.

Uploaded by

litbumreader

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

267 views

Parallel & Distributed Computing

Uploaded by

litbumreader

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Parallel & Distributed Computing

Computer Cluster, MapReduce, Hadoop

What is Serial Computing?

Traditionally, software has been written for serial computation:
o To be run on a single computer having a single Central Processing Unit (CPU) o A problem is broken into a discrete series of instructions o Instructions are executed one after another o Only one instruction may execute at any moment in time

Serial Computing

What is Parallel Computing?

In the simplest sense, the simultaneous use of multiple computing resources to solve a computational problem:
o Run using multiple CPUs in a single computer o Problem is broken into discrete parts that can be solved concurrently o Each part is further broken down to a series of instructions o Instructions from each part execute simultaneously on different CPUs

All processors may have access to a shared memory to exchange information between processors

Parallel Computing

Uses for Parallel Computing (1)

Science and Engineering:
o Atmosphere, Earth, Environment o Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics o Bioscience, Biotechnology, Genetics o Chemistry, Molecular Sciences o Geology, Seismology o Mechanical Engineering - from prosthetics to spacecraft o Electrical Engineering, Circuit Design, Microelectronics o Computer Science, Mathematics

Uses for Parallel Computing (2)

Industrial and Commercial:
o o o o o o o o Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Financial and economic modeling Management of national and multi-national corporations Advanced graphics and virtual reality, particularly in the entertainment industry o Networked video and multi-media technologies o Collaborative work environments

What is Distributed Computing?

A field of computer science that studies distributed systems Purpose is to coordinate use of shared resources Run using multiple CPUs across many computers Problem is divided into many tasks, each of which is solved by one or a collection of independent computers that appears to its users as a single coherent system System where hardware/software components located at networked computers communicate and coordinate their actions only by message passing Each processor has its own private memory (distributed memory)

Distributed Computing

Distributed & Parallel Systems Differences

Distributed Parallel Computing

Computer Cluster

What is a Computer Cluster?

Rapidly growing trend has emerged as a type of parallel or distributed processing system Consists of a set of loosely connected computers that work together to be viewed as a single system Components are connected to each other through fast local area networks, each node running its own instance of an operating system Activities of the computing nodes are orchestrated by "clustering middleware"

Computer Cluster Architecture

Computer Cluster Configuration

Why Computer Cluster?

More computing horsepower & better reliability by orchestrating a number of low cost commercial off-theshelf computers Deployed to improve performance and availability over that of a single computer More cost-effective alternative to single computers of comparable speed or availability Relies on centralized management approach, makes nodes available as orchestrated shared servers

Cluster Problems to Solve

Largest most important problem is software skew o When software configuration on some nodes is different than on others o Small differences (minor version numbers on libraries) can cripple a parallel program Second most important problem is adequate job control of the parallel process o Signal propagation o Cleanup Can be hard to manage without experience Determining where something has failed increases linearly as cluster size increases

Cluster Beowulf Design

Beowulf cluster: basic approach to building a cluster Beowulf system: application programs never see the computational (slave) nodes Only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves Master has two network interfaces
o One communicates with the private Beowulf network for the slaves o Another for the general purpose network of the organization

Beowulf Cluster Configuration

Cluster Task Scheduling

When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes a challenge In a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto Central Processing Unit cores provides significant challenges An area of ongoing research and algorithms that combine and extend MapReduce and Hadoop have been proposed, studied, & implemented

MapReduce

What is MapReduce?
Framework developed by Google for processing parallelizable problems across large data sets using a large cluster of computers (nodes) Programing paradigm, splits up a task into smaller subtasks that can be executed in parallel & therefore run faster compared to a single computer execution Simplified data processing on large clusters a large server farm can use MapReduce to sort a petabyte of data in only a few hours Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured)

MapReduce Overview

Word Count Application - MapReduce

Word Count MapReduce Example

Counts the appearance of each word in a set of documents function map(String name, String document) { // name: document name // document: document contents for each word w in document { emit (w, 1) } } function reduce(String word, Iterator partialCounts) { // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts { sum += pc emit (word, sum) } }

MapReduce Architecture
Job Scheduling System
o Jobs made up of tasks, master scheduler assigns tasks to slave machines (nodes), easy to distribute across nodes o Input, final output are stored on a distributed file system o Master pings slaves periodically to detect failures o Slaves send heartbeats back to master periodically o Master responds with task if a slot is free, picking task with data closest to the node o Takes advantage of locality of data, processing data on or near the storage assets to decrease transmission of data

Automatic parallelization & distribution

o Allows for distributed processing of the map and reduction operations o A set of 'reducers' can perform the reduction phase if all outputs of the map operation that share the same key are presented to the same reducer at the same time o If each mapping operation is independent of the others, all maps can be performed in parallel

Fault Tolerance
o Parallelism offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled

MapReduce Programming Model

Inspired by the map and reduce functions commonly used in Lisp and other functional programming languages, although their purpose in the MapReduce framework is not the same as their original forms The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop

MapReduce Distributed Execution

MapReduce Dataflow
Consists of a single master JobTracker and one slave TaskTracker per cluster-node Dataflow: an input reader a Map function a partition function a compare function a Reduce function an output writer

MapReduce Dataflow

MapReduce Input Reader

Divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) Framework assigns one split to each Map function

Reader reads data from stable storage (typically a distributed file system) and generates key/value pairs A common example will read a directory full of text files and return each line as a record

MapReduce Map Function

Master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes Worker node may do this again in turn, leading to a multi-level tree structure Worker node processes the smaller problem, and passes the answer back to its master node Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) list(k2,v2) Map function is applied in parallel to every pair in the input dataset, which produces a list of pairs for each call MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each one of the different generated keys

MapReduce Partition Function

Each Map function output is allocated to a particular reducer for horizontal partitioning purposes Partition function is given the key and the number of reducers and returns the index of the desired reduce Typical default is to hash the key and modulo the number of reducers Important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced

MapReduce Key Comparison Function

The input for each Reduce is pulled from the node machine where the Map ran and sorted using the application's comparison function Key Comparison is used to sort the final emitted outputs of reduce before returning the list of result keys

MapReduce Reduce Step

Master node collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) list(v3)

Each Reduce call typically produces either one value or an empty return, which is collected as the desired result list

MapReduce Output Writer

Writes the output of the Reduce to stable storage, usually a distributed file system

MapReduce Parallel Execution

MapReduce Distribution & Reliability

Achieves reliability by parceling out a number of operations on the set of data to each node in the network Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes Master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on, which is desirable as it conserves bandwidth across the backbone network of the datacenter Implementations are not necessarily highly-reliable, for example, in Hadoop the NameNode is a single point of failure for the distributed file system

MapReduce Uses
MapReduce aids organizations in processing and analyzing large volumes of multi-structured data, are often difficult to implement using the standard SQL employed by relational DBMSs Uses Include: distributed pattern-based searching, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web, replacing the old ad hoc programs that updated the index and ran the various analyses

Hadoop

What is Hadoop?
Open source Java software framework to support dataintensive distributed (file system) applications provided by Apache Software Foundation Solves the problem of a tremendous amount of data that needs to be analyzed and processed very quickly Allows for the distributed processing of large data sets across clusters of computers using simple programming models Delivers a highly-available service on top of a cluster of computers Designed to scale up from single servers to thousands of machines and to detect and handle failures at the application layer

Hadoop Main Components

What is Hive?
Hive has gained the most acceptance in the industry Main benefit - dramatically improves the simplicity and speed of MapReduce development Its SQL-like syntax makes it easy to use by nonprogrammers who are comfortable using SQL HiveQL statements entered using a command line or Web interface, or may be embedded in applications that use ODBC and JDBC interfaces to the Hive system Hive Driver system converts the query statements into a series of MapReduce jobs Data files in Hive are seen in the form of tables (and views), but do not support the concepts of primary or foreign keys or constraints of any type

Hive Main Components

Hadoop Architecture

Hadoop Cluster Architecture

To eliminate overhead impeding performance
o No server virtualization o No hypervisor layer

Runs best on Linux machines, working directly with the underlying hardware Utilize rack servers (not blades) populated in racks connected to the top of rack switch Majority of the servers will be Slave nodes with lots of local disk storage and moderate amounts of CPU and DRAM Some machines will be Master nodes that might have a slightly different configuration favoring more DRAM and CPU, less local storage

Hadoop

Hadoop Deployment Machine Roles

Client machine: load data into the cluster, submit Map Reduce jobs, and then retrieve the results of the job when its finished Master nodes: oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce) Name Node: oversees, coordinates, controls the data storage function (HDFS), manages access control Job Tracker: hands out tasks to the slave nodes, oversees and coordinates the parallel processing of data using Map Reduce Slave Nodes: make up the majority of machines, do the work of storing the data and running the computations
o Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes o The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon a slave to the Name Node

Hadoop MapReduces Job Flow

Driver program submits the job configuration to the JobTracker node JobTracker - splits the Job into individual tasks and submits them to respective TaskTrackers, which reside on the DataNodes themselves, or at least on the same rack on which the data is present TaskTracker - receives its share of input data, and starts processing the map function specified by the configuration When all Map Tasks are completed, JobTracker asks the TaskTrackers to start processing the reduce function JobTracker deals with failed and unresponsive tasks by running backup tasks, whichever node completes first, gets its output accepted When both the Map and Reduce tasks are completed, JobTracker notifies the client program, and dumps the output into the specified output directory

Five Star Hotel Space Program
67% (3)
Five Star Hotel Space Program
11 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
47 pages
Parallel and Distributed Computing
100% (2)
Parallel and Distributed Computing
20 pages
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
No ratings yet
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
4 pages
Assignment: Parallel and Distributed Computing Submitted To: Sir Shoaib Date: 25-03-2019
No ratings yet
Assignment: Parallel and Distributed Computing Submitted To: Sir Shoaib Date: 25-03-2019
5 pages
Lab Mannual DBMS
No ratings yet
Lab Mannual DBMS
128 pages
Lecture 1 - Parallel and Distributed Computing
100% (1)
Lecture 1 - Parallel and Distributed Computing
25 pages
Parallel and Distributed Computing Lecture 03
No ratings yet
Parallel and Distributed Computing Lecture 03
44 pages
Parallel and Distributed Computing Architectures A PDF
No ratings yet
Parallel and Distributed Computing Architectures A PDF
286 pages
Graph Data Structure
No ratings yet
Graph Data Structure
11 pages
Introduction To Distributed Systems
No ratings yet
Introduction To Distributed Systems
45 pages
Chapter 01 - Introduction Distributed Syetem
No ratings yet
Chapter 01 - Introduction Distributed Syetem
45 pages
Compiler Design Notes
No ratings yet
Compiler Design Notes
196 pages
Unit I
No ratings yet
Unit I
53 pages
Design and Analysis of Algorithm
No ratings yet
Design and Analysis of Algorithm
116 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Distributed Computing Environment
No ratings yet
Distributed Computing Environment
42 pages
Ccs355 Neural Networks and Deep Learning Unit1 (1)
No ratings yet
Ccs355 Neural Networks and Deep Learning Unit1 (1)
29 pages
Variables, Expressions, and Statements: Python For Informatics: Exploring Information
No ratings yet
Variables, Expressions, and Statements: Python For Informatics: Exploring Information
33 pages
NN UNIT-1 Complete Notes with 153 pages (1)
No ratings yet
NN UNIT-1 Complete Notes with 153 pages (1)
153 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
DBMS Complete Note PDF
No ratings yet
DBMS Complete Note PDF
130 pages
CS6601 Distributed System Question Bank
100% (2)
CS6601 Distributed System Question Bank
5 pages
Graph Theory-Mfcs Material
No ratings yet
Graph Theory-Mfcs Material
67 pages
CS8591-Computer Networks Department of CSE 2020-2021
No ratings yet
CS8591-Computer Networks Department of CSE 2020-2021
24 pages
TYPES OF SCHEDULING ALGORITHMS in Cloud
100% (1)
TYPES OF SCHEDULING ALGORITHMS in Cloud
4 pages
Unit 1 Cloud Computing
No ratings yet
Unit 1 Cloud Computing
22 pages
CN Unit-3
No ratings yet
CN Unit-3
32 pages
Threads
No ratings yet
Threads
47 pages
Cs6660 Compiler Design Appasami
100% (1)
Cs6660 Compiler Design Appasami
189 pages
Distributed Computing Lab Workbook V1.0
100% (1)
Distributed Computing Lab Workbook V1.0
129 pages
Distributed Systems
100% (1)
Distributed Systems
71 pages
Question Bank Oomd
100% (1)
Question Bank Oomd
3 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
Distributed Database System
No ratings yet
Distributed Database System
100 pages
Distributed Computing
No ratings yet
Distributed Computing
13 pages
CS - 687 Parallel and Distributed Computing
100% (2)
CS - 687 Parallel and Distributed Computing
3 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
Lab No. 04 Title: Developing Data Flow Diagram (DFD) Model of A Project
No ratings yet
Lab No. 04 Title: Developing Data Flow Diagram (DFD) Model of A Project
7 pages
The Design and Analysis of Algorithms: by Anany Levitin
100% (1)
The Design and Analysis of Algorithms: by Anany Levitin
14 pages
Data Structures and Algorithms Handwritten Notes ?
No ratings yet
Data Structures and Algorithms Handwritten Notes ?
127 pages
Design and Analysis of Algorithms (DAA) Notes
No ratings yet
Design and Analysis of Algorithms (DAA) Notes
112 pages
Advaned Analysis of Algorithm
No ratings yet
Advaned Analysis of Algorithm
19 pages
Principles of Operating System Questions and Answers
No ratings yet
Principles of Operating System Questions and Answers
15 pages
Cooperative Process: Prepared & Presented By: Abdul Rehman & Muddassar Ali
No ratings yet
Cooperative Process: Prepared & Presented By: Abdul Rehman & Muddassar Ali
18 pages
Cs8602 Unit 4 Access To Nonlocal Data On The Stack
No ratings yet
Cs8602 Unit 4 Access To Nonlocal Data On The Stack
15 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Python Lab Manual PDF
No ratings yet
Python Lab Manual PDF
94 pages
Heap Data Structure
100% (1)
Heap Data Structure
19 pages
Creating An Architectural Design
No ratings yet
Creating An Architectural Design
18 pages
Distributed File Systems
No ratings yet
Distributed File Systems
18 pages
Software Engineering Notes
No ratings yet
Software Engineering Notes
75 pages
Tutorial On Network Simulator (NS2)
No ratings yet
Tutorial On Network Simulator (NS2)
24 pages
Introduction to Internet & Web Technology: Internet & Web Technology
From Everand
Introduction to Internet & Web Technology: Internet & Web Technology
Dr. Yashpal singh
No ratings yet
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Cloud 4 Unit
No ratings yet
Cloud 4 Unit
26 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Week 02
No ratings yet
Week 02
115 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
VideoJet X10/20/40 Video Encoder
No ratings yet
VideoJet X10/20/40 Video Encoder
6 pages
Programmability Open NX-OS
No ratings yet
Programmability Open NX-OS
285 pages
Structural Engineering of Complex-Shaped Tall Buildings: Kyoung Sun Moon
No ratings yet
Structural Engineering of Complex-Shaped Tall Buildings: Kyoung Sun Moon
6 pages
Bricscad
No ratings yet
Bricscad
4 pages
MS Commond
No ratings yet
MS Commond
16 pages
Covered Outdoor Entertaining Alfresco 7.1 X 10.6m: First Floor
No ratings yet
Covered Outdoor Entertaining Alfresco 7.1 X 10.6m: First Floor
1 page
Axis 2130 User's Manual
No ratings yet
Axis 2130 User's Manual
41 pages
Bengalac Matt - English (Uk) - Issued.06.12.2007
No ratings yet
Bengalac Matt - English (Uk) - Issued.06.12.2007
3 pages
NetCracker SDN Solution Brochure
100% (1)
NetCracker SDN Solution Brochure
2 pages
Sika MonoTop 438 R - Micro Concrete - PDS
No ratings yet
Sika MonoTop 438 R - Micro Concrete - PDS
4 pages
Etabs Checklist
No ratings yet
Etabs Checklist
2 pages
Penyerapan Koefisien Bahan Bangunan Umum Dan Selesai
No ratings yet
Penyerapan Koefisien Bahan Bangunan Umum Dan Selesai
98 pages
Footing Asymmetrical
No ratings yet
Footing Asymmetrical
20 pages
PERVIOUS Concrete
No ratings yet
PERVIOUS Concrete
5 pages
Herbert Dreiseitl
No ratings yet
Herbert Dreiseitl
7 pages
Dungeons
No ratings yet
Dungeons
15 pages
Seminar Presentation On Cable Stayed Bridge
No ratings yet
Seminar Presentation On Cable Stayed Bridge
26 pages
DSR 2021 Combined
100% (1)
DSR 2021 Combined
622 pages
The Open Group Architecture Framework
No ratings yet
The Open Group Architecture Framework
4 pages
Internship Portfolio
No ratings yet
Internship Portfolio
63 pages
Astm C 76 - 99
No ratings yet
Astm C 76 - 99
11 pages
DBR Part 3
No ratings yet
DBR Part 3
4 pages
PCE Technical Paper Mechanical - WONG SEE FONG
No ratings yet
PCE Technical Paper Mechanical - WONG SEE FONG
18 pages
AWSElemental Live To VOD
No ratings yet
AWSElemental Live To VOD
41 pages
Flat-Plate Collector Sun600: Technical Datasheet NR 95
No ratings yet
Flat-Plate Collector Sun600: Technical Datasheet NR 95
1 page
TPS 6X Setup
No ratings yet
TPS 6X Setup
3 pages
Comos COMOS 10.1 - Readme: A5E32016431-AA
No ratings yet
Comos COMOS 10.1 - Readme: A5E32016431-AA
94 pages
Mobile: Push For Sync & Notifications: by Mike Willbanks Software Engineering Manager Caringbridge
No ratings yet
Mobile: Push For Sync & Notifications: by Mike Willbanks Software Engineering Manager Caringbridge
57 pages
7.7 Ddos:: Unknown Secrets and Botnet Counter-Attack
No ratings yet
7.7 Ddos:: Unknown Secrets and Botnet Counter-Attack
46 pages

Parallel & Distributed Computing

Uploaded by

Parallel & Distributed Computing

Uploaded by

Parallel & Distributed Computing

Computer Cluster, MapReduce, Hadoop

What is Serial Computing?

What is Parallel Computing?

Uses for Parallel Computing (1)

Uses for Parallel Computing (2)

What is Distributed Computing?

Distributed & Parallel Systems Differences

Distributed Parallel Computing

What is a Computer Cluster?

Computer Cluster Architecture

Computer Cluster Configuration

Why Computer Cluster?

Cluster Problems to Solve

Cluster Beowulf Design

Beowulf Cluster Configuration

Cluster Task Scheduling

Word Count Application - MapReduce

Word Count MapReduce Example

Automatic parallelization & distribution

MapReduce Programming Model

MapReduce Distributed Execution

MapReduce Input Reader

MapReduce Map Function

MapReduce Partition Function

MapReduce Key Comparison Function

MapReduce Reduce Step

MapReduce Output Writer

MapReduce Parallel Execution

MapReduce Distribution & Reliability

Hadoop Main Components

Hive Main Components

Hadoop Cluster Architecture

Hadoop Deployment Machine Roles

Hadoop MapReduces Job Flow

Hadoop MapReduces Job Flow

You might also like