TheEdge10 : Big Data is Here - Hadoop to the Rescue

Big Data is Here – Hadoop to the Rescue!Shay Sofer,AlphaCSP

Today we will:Understand what is BigDataGet to know HadoopExperience some MapReduce magicPersist very large filesLearn some nifty tricksOn Today's Menu...

IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021)60% Growth from 2009By 2020 – we will reach 35 ZBFacts and NumbersData is Everywhere

Facts and NumbersData is EverywhereSource: www.idc.com

234M Web sites7M New sites in 2009New York Stock Exchange – 1 TB of data per dayWeb 2.0147M Blogs (and counting…)Twitter – ~12 TB of data per dayFacts and NumbersData is Everywhere

500M users40M photos per day More than 30billion pieces of content (web links, news stories, blog posts, notes, photo albums etc.) shared each monthFacts and Numbers - FacebookData is Everywhere

Big dataare datasets that grow so large that they become awkward to work with using on-hand database management toolsWhere and how do we store this information?How do we perform analyses on such large datasets?Why are you here?Data is Everywhere

Scale-up Vs. Scale-outData is Everywhere

Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computerScale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software applicationScale-up Vs. Scale-outData is Everywhere

A framework for writing and running distributed applications that process large amount of data.Runs on large clusters of commodity hardwareA cluster with hundreds of machine is standardInspired by Google’s architecture : MapReduce and GFSWhat is Hadoop?Hadoop

Robust - Handles failures of individual nodesScales linearlyOpen source A top-level Apache projectWhy Hadoop?Hadoop

Facebook holds the largest known Hadoop storage cluster in the world2000 machines12 TB per machine (some has 24 TB)32 GB of RAM per machineTotal of more than 21 Petabytes (1 Petabyte = 1024 Terabytes) Facebook (Again…)Hadoop

HistoryHadoopApache Nutch – Open Source web search engine founded by Doug CuttingCutting joins Yahoo!, forms HadoopSorting 1 TB in 62 seconds200420062008200820022010Google’s GFS & MapReduce papers publishedCreating the longest Pi yetHadoop hits web scale, being used by Yahoo! for web indexing

A programming model for processing and generating large data setsIntroduced by Google Parallel processing of the map/reduce operationsDefinitionMapReduce

Sam believed “An apple a day keeps a doctor away”MapReduce – The Story of SamMotherSamAn AppleSource: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Sam thought of “drinking” the appleMapReduce – The Story of SamHe used a to cut the and a to make juice. Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Sam applied his invention to all the fruits he could find in the fruit basket(map ‘( )) MapReduce – The Story of SamA list of values mapped into another list of values, which gets reduced into a single value( ) (reduce ‘( )) Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

MapReduce – The Story of SamSam got his first job for his talent in making juiceFruitsNow, it’s not just one basket but a whole container of fruitsLargedata and list of values for outputAlso, they produce alist of juice types separately

But, Sam had just ONE and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

MapReduce – The Story of SamSam Implemented a parallelversion of his innovation Each map input: list of <key, value> pairsFruits(<a, > , <o, > , <p ,> , …)MapEach map output: list of <key, value> pairs(<a’ , > , <o’, v > , <p’ , > , …)Grouped by key (shuffle)Each reduce input: <key, value-list>e.g. <a’, ( …)>ReduceReduced into a list of valuesSource: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs

Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs (k1, v1) list(k2, v2)Reducer- Iterates through the values that are associated with a specific key and generate output (k2, list (v2)) list(k3, v3)The Mapper takes the input data, filters and transforms into something The Reducercan aggregate overFirst Map, Then ReduceMapReduce

Hadoop comes with a number of predefined classesBooleanWritableByteWritableLongWritableText, etc…Supports pluggable serialization frameworksApache Avro Hadoop Data TypesMapReduce

TextInputFormat / TextOutputFormatKeyValueTextInputFormatSequenceFile - A Hadoopspeciﬁc compressed binary ﬁle format. Optimized for passing data between 2 MapReduce jobsInput / Output FormatsMapReduce

publicstaticclass MapClass extends MapReduceBaseprivateText word = new Text();publicvoid map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …){String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){word.set(itr.nextToken());output.collect(word,newIntWritable(1)); } } } Word Count – The Mapperimplements Mapper<LongWritable,Text,Text,IntWritable>< Hello, 1> < World, 1> < Bye, 1> < World, 1> <K1,Hello World Bye World>

publicstaticclassReduceClassextends MapReduceBasepublicvoidreduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,…){intsum = 0;while(values.hasNext()){sum += values.next().get(); }output.collect(key, new IntWritable(sum));{{Word Count– The ReducerimplementsReducer<Text,IntWritable,Text,IntWritable>{< Hello, 1> < World, 2> < Bye, 1> < Hello, 1> < World, 1> < Bye, 1> < World, 1>

publicstaticvoid main(String[] args){JobConf job = newJobConf(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,newPath(args[1]));//job.setInputFormat(KeyValueTextInputFormat.class);JobClient.runJob(job);{Word Count – The Driver

Music discovery websiteScrobbling / Streaming VIA radio40M unique visitors per monthOver 40M scrobbles per dayEach scrobble creates a log lineHadoop @ Last.FMMapReduce

TheEdge10 : Big Data is Here - Hadoop to the Rescue

Goal : Create a “Unique listeners per track” chartSample listening dataMapReduce

publicvoid map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable> output, Reporter reporter) throwsIOException { intscrobbles, radioListens; // assume they are initialized -IntWritabletrackId,userId; // for verbosity // if track somehow is marked with zero plays - ignoreif (scrobbles <= 0 && radioListens <= 0) {return; }// output user id against track idoutput.collect(trackId, userId); }Unique Listens - Mapper

publicvoid reduce(IntWritabletrackId, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throwsIOException { Set<Integer> usersSet = newHashSet<Integer>();// add all userIds to the set, duplicates removedwhile (values.hasNext()) {IntWritableuserId = values.next();usersSet.add(userId.get()); }// output: trackId -> number of unique listeners per trackoutput.collect(trackId, newIntWritable(usersSet.size()));}Unique Listens - Reducer

Complex tasks will sometimes be needed to be broken down to subtasksOutput of the previous job goes as input to the next jobjob-a | job-b | job-cSimply launch the driver of the 2nd job after the 1stChainingMapReduce

Hadoop supports other languages via API called StreamingUse UNIX commands as mappers and reducersOr use any script that processes line-oriented data stream from STDIN and outputs to STDOUTPython, Perl etc.Hadoop StreamingMapReduce

$ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.pyHadoop StreamingMapReduce

HDFSHadoop Distributed File System

A large dataset can and will outgrow the storage capacity of a single physical machinePartition it across separate machines – Distributed FileSystemsNetwork based - complexWhat happens when a node fails?Distributed FileSystemHDFS

Designed for storing very large files running on clusters on commodity hardwareHighly fault-tolerant (via replication)A typical file is gigabytes to terabytes in sizeHigh throughputHDFS - Hadoop Distributed FileSystemHDFS

Running Hadoop = Running a set of daemons ondifferent servers in your networkNameNodeDataNodeSecondary NameNodeJobTrackerTaskTrackerHadoop’s Building BlocksHDFS

Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker

HDFS has a master/slave architecture ; The NameNode acts as the masterSingle NameNode per HDFSKeeps track of :How the files are broken into blocksWhich nodes store those blocksThe overall health of the filesystemMemory and I/O intensiveThe NameNodeHDFS

Each slave machine will host a DataNode daemonServes read/write/delete requests from the NameNodeManages the storage attached to the nodes Sends a periodic Heartbeat to the NameNodeThe DataNodeHDFS

Failure is the norm rather than exceptionDetection of faults and quick, automatic recoveryEach file is stored as a sequence of blocks (default: 64MB each)The blocks of a file are replicated for fault toleranceBlock size and replicas are configurable per fileFault Tolerance - ReplicationHDFS

Assistant daemon that should be on a dedicated nodeTakes snapshots of the HDFS metadataDoesn’t receive real time changesHelps minimizing downtime incase the NameNode crashesSecondary NameNodeHDFS

One per cluster - on the master nodeReceives job request submitted by the clientSchedules and monitors MapReduce jobs on TaskTrackersJobTrackerHDFS

Run map and reduce tasksSend progress reports to the JobTrackerTaskTrackerHDFS

VIA file commands$ hadoopfs -mkdir /user/chuck$ hadoopfs -put hugeFile.txt$ hadoopfs -get anotherHugeFile.txtProgrammatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){ out.write(buffer,0,bytesRead);}Working with HDFSHDFS

Tip #1: Hadoop Configuration TypesTips & Tricks

Monitoring events in the cluster can prove to be a bit more difficultWeb interface for our clusterShows a summary of the clusterDetails about list of jobs there are currently running, completed and failedTip #2: JobTracker UI Tips & Tricks

Digging through logs or…. Running again the exact same scenario with the same input on the same node?IsolationRunner can rerun the failed task to reproduce the problemAttach a debugger Keep.failed.tasks.file= trueTip #3: IsolationRunner – Hadoop’s Time MachineTips & Tricks

Output of the map phase (which will be shuffled across the network) can be quite largeBuilt in support for compressionDifferent codecs : gzip, bzip2 etcTransparent to the developerconf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);Tip #4: CompressionTips & Tricks

A node can experience a slowdown, thus slowing down the entire jobIf a task is identified as “slow”, it will be scheduled to run in another node in parallelAs soon as one finishes successfully, the others will be killedAn optimization – not a featureTip #5: Speculative ExecutionTips & Tricks

Input can come from 2 (or more) different sourcesHadoop has a contrib package called datajoinGeneric framework for performing reduce-side joinTip #6: DataJoin PackageMapReduce

Hadoop in the CloudAmazon Web Services

Cloud computing - Shared resources and information are provided on demandRent a cluster rather than buy itThe best known infrastructure for cloud computing is Amazon Web Services (AWS)Launched at July 2002Cloud Computing and AWSHadoop in the Cloud

Elastic Compute Cloud (EC2)A large farm of VMs where a user can rent and use them to run a computer applicationWide range on instance types to choose from (price varies)Simple Storage Service (S3) – Online storage for persisting MapReduce data for future useHadoop comes with built in support for EC2 and S3$ hadoop-ec2 launch-cluster <cluster-name> <num-of-slaves> Hadoop in the Cloud – Core Services

EC2 Data FlowHDFSEC2MapReduce TasksOurData

EC2 & S3 Data FlowS3OurDataHDFSEC2MapReduce Tasks

Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivialPig simplifies Hadoop programmingProvides high-level data processing language : Pig LatinBeing used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc..Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25PigHadoop-Related Projects

Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BYage >= 18 AND age <= 25;Pages = LOAD ‘pages.csv’ AS (user, url);Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;STORE Top5 INTO ‘top5sites.csv’;Pig Latin – Data Flow Language

A data warehousing package built on top of HadoopSQL-like queries on large datasets HiveHadoop-Related Projects

Hadoop database for random read/write accessUses HDFS as the underlying file systemSupports billions of rows and millions of columnsFacebook chose HBase as a framework for their new version of “Messages”HBaseHadoop-Related Projects

A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backportsClouderaHadoop-Related Projects

Machine learning algorithms for HadoopComing up next.. (-:MahoutHadoop-Related Projects

Big Data can and will cause serious scalability problems to your applicationMapReduce for analysis, Distributed filesystem for storageHadoop = MapReduce + HDFS and much moreAWS integration is easyLots of documentationLast wordsSummary

Hadoop in Action / Chuck LamHadoop: The Definitive Guide, 2nd Edition / Tom White (O’reilly)Apache Hadoop DocumentationHadoop @ Last.FM Presentation MapReduce in Simple Terms / SaliyaEkanayakeAmazon Web ServicesReferences

TheEdge10 : Big Data is Here - Hadoop to the Rescue

More Related Content

TheEdge10 : Big Data is Here - Hadoop to the Rescue

Editor's Notes