Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DAG Vs MapReduce

Download as pdf
Download as pdf
You are on page 1of 4
RESOURCE CENTER (HTTP://MAMMOTHDATACOM/RESOURCE.CENTER/) | PARTNERS (HTTP.//MAMMOTHDATA.COM/PARTNERS/) ‘OUR TEAM (HTTP://MAMMOTHDATACOM/TEAM/) || CAREERS (HTTP://MAMMOTHDATA.COM/CAREERS/) BLOG (HTTP://MAMMOTHDATACOM/BLOG) | NEWS (HTTP.//MAMMOTHDATA.COM/BLOG-NEWS/) B (https:/www linkedin.com/company/open- software- 6) © Bintegrators- (httietprathisiitiineboatigie rodinyhiatar ob ACA lg 2888) 7 MOTH (HTTP//MAMMOTHDATACOM/) BLOG DAG vs MapReduce The new generation of Big Data tools largely focus on improving support fo eal-time (or near-time) computation and interactive applications by educing the latency involved in processing jobs. f you look at Storm, Spark, Tez, and other newer tools, you will frequently encounter the term “DAG" or Directed Acyclic Graph. This article will explain why traditional MapReduce is subject to undesirable latencies vhat a DAG is, and why these new systems use this approach. Jadoop, which began life specifically as an implementation of the MapReduce paradigm, has traditionally elied on MapReduce as its primary programming model. Hadoop MapReduce jobs display high latencies as 1 result of the programming model of traditional MapReduce, in which jobs follow a stock structure of ‘map,” allowed by “shuffle,” followed by “reduce” steps. Even single-step jobs under MapReduce tend to feature higl atencies. This problem is exacerbated for more complex processing involving “chaining” successive JlapReduce jobs. In multi-step jobs, each job is blocked from beginning until all of the preceding jobs have inished, As a result of this model, complex computations can require time on the order of minutes, hours, or onger — even with fairly small data volumes. \ Directed Acyclic Graph, in this context, refers to a model for scheduling work in which jobs are represented 1s vertices in a graph, where the order of execution is specified by the directionality of the edges in the graph The “acyclic” part just means that there are no loops (“cycles”) in the graph. In a system which schedules jobs tsing a DAG, independent nodes (computational steps) in the graph can run in parallel, rather than iequentially. This approach makes it easier for programmers to build more complex multi-step computations, and avoids the scheduling overhead imposed by traditional MapReduce. 3f course simply switching to a DAG for scheduling does not alleviate the high latencies associated with iingle-step Hadoop MapReduce jobs. This is why even workflows constructed as DAGs that link Hadoop JlapReduce jobs, still suffer in the latency area, An example of this problem would be using external scheduler like Oozie to control a series of MapReduce jobs. Each workflow stil has to pay the cost of high itartup times and high latencies for individual jobs. So in order to achieve low overall latency, systems such 1s Spark, Storm, Samza, and others have also added other optimizations — primarily copying data into nemory and performing substantially less disk (/O. Aside from improving latency, DAG based systems have other advantages, For example, itis simpler to mplement a fault tolerant approach using a DAG. In the event of a job failure, you can easily backtrack hrough the graph and re-execute any failed jobs, even at intermediate stages of a computation. The enforcec arder of the graph always allows you to walk through the graph from any node, to the eventual end. ‘inally, we would be remiss in not pointing out that Hadoop has also moved beyond its historical reliance on simple MapReduce as well. The Hadoop 2.x series has refactored the resource allocation and scheduling somponents to support a much more flexible architecture, which allows the implementation of new, non JlapReduce, programming models. With Hadoop 2 other processing engines can layer on top of YARN and rrovide low-latency, real-time processing, while living side-by-side with jobs written for MapReduce, MPI, 3SP, or other execution models. Spark, in fact, can be deployed onto an existing Hadoop cluster, and take dvantage of YARN for scheduling and resource allocation, \s you can see, a Directed Acyclic Graph approach is a key element of most next-generation, real-time Big data platforms, These tools, including Storm, Spark, Samza and Tez, offer amazing new capabilities for 2uilding highly interactive, real-time computing systems to power your real-time Bl, predictive analytics, real- ime marketing and other critical systems. \re you looking to incorporate a new generation of Big Data tools to support real-time computation and nteractive applications? Interested in Hadoop or expanding into the Hadoop ecosystem to give ‘our organization the data-driven success stories it needs. Give us a call at 919.321.0119 or shoot us an email afoamammothdata.com to get started = Phil Rhodes, Senior Consultar Leave a Reply ‘our email address will not be published. Comment Name Email Website TLL (https:/clutch.co/researchibig- CONSULTANTS Jf Ee sen ( BIG eel Review SOLUTION PROVIDERS 2015 (http:!/mammothdata.com/news/mammoth-data-named-most- promising-big-data-solutions-provider-by-cic-reviow!) Copyright © 2015 » Mammoth Data, Inc. + All rights reserved Contact (htp:!imammothdata.com/contact/) |” Privacy Policy (hitp:/mammothdata.com/prvacy/) (nttp://mammothdata.com/) MAMMOTH DATA Mammoth Data, Inc. 345 West Main Street Suite 201 Durham, NC 27701 #1.919.321.0119 info@mammothdata.com (mailto:info@mammothdata.com)

You might also like