Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Big Data is Here – Hadoop to the Rescue!Shay Sofer,AlphaCSP
Today we will:Understand what is BigDataGet to know HadoopExperience some MapReduce magicPersist very large filesLearn some nifty tricksOn Today's Menu...
Data is Everywhere
IDC : “Total data in the universe : 1.2 Zettabytes”  (May, 2010)1ZB = 1 Trillion Gigabytes     (or: 1,000,000,000,000,000,000,000 bytes = 1021)60% Growth from 2009By 2020 – we will reach 35 ZBFacts and NumbersData is Everywhere
Facts and NumbersData is EverywhereSource: www.idc.com
234M Web sites7M New sites in 2009New York Stock Exchange – 1 TB of data per dayWeb 2.0147M Blogs (and counting…)Twitter – ~12 TB of data per dayFacts and NumbersData is Everywhere
500M users40M photos per day   More than 30billion pieces of content (web links,          news stories, blog posts, notes, photo albums etc.)   shared each monthFacts and Numbers - FacebookData is Everywhere
Big dataare datasets that grow so large that they become awkward to work with using on-hand database management toolsWhere and how do we store this information?How do we perform analyses on such large datasets?Why are you here?Data is Everywhere
Scale-up Vs. Scale-outData is Everywhere
Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computerScale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software applicationScale-up Vs. Scale-outData is Everywhere
Introducing…Hadoop!
A framework for writing and running distributed applications that process large amount of data.Runs on large clusters of commodity hardwareA cluster with hundreds of machine is standardInspired by Google’s architecture : MapReduce and GFSWhat is Hadoop?Hadoop
Robust - Handles failures of individual nodesScales linearlyOpen source A top-level Apache projectWhy Hadoop?Hadoop
Hadoop
Facebook holds the largest known Hadoop storage cluster in the world2000 machines12 TB per machine (some has 24 TB)32 GB of RAM per machineTotal of more than 21 Petabytes (1 Petabyte = 1024 Terabytes) Facebook (Again…)Hadoop
HistoryHadoopApache Nutch – Open Source web search engine founded by Doug CuttingCutting joins Yahoo!, forms HadoopSorting 1 TB in 62 seconds200420062008200820022010Google’s GFS & MapReduce papers publishedCreating the longest Pi yetHadoop hits web scale, being used by Yahoo! for web indexing
Hadoop
IDE PluginHadoop
Hadoop and MapReduce
A programming model for processing and generating large data setsIntroduced by Google Parallel processing of the map/reduce operationsDefinitionMapReduce
Sam believed “An apple a day keeps a doctor away”MapReduce – The Story of SamMotherSamAn AppleSource: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Sam thought of “drinking” the appleMapReduce – The Story of SamHe used a        to cut  the          and a          to make juice.     Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Sam applied his invention to all the fruits he could find in the fruit basket(map        ‘(                       )) MapReduce – The Story of SamA list of values mapped into another list of values, which gets reduced into a single value(                        ) (reduce          ‘(                       )) Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
MapReduce – The Story of SamSam got his first job for his talent in making juiceFruitsNow, it’s not just one basket but a whole container of fruitsLargedata and list of values for outputAlso, they produce alist of juice types separately
But, Sam had just ONE          and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
MapReduce – The Story of SamSam Implemented a parallelversion of his innovation Each map input: list of <key, value> pairsFruits(<a,    > , <o,    > , <p    ,>  , …)MapEach map output: list of <key, value> pairs(<a’ ,   > , <o’, v  > , <p’ ,    > , …)Grouped by key (shuffle)Each reduce input: <key, value-list>e.g. <a’, (               …)>ReduceReduced into a list of valuesSource: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs     (k1, v1)          list(k2, v2)Reducer- Iterates through the values that are associated with a specific key and generate output    (k2, list (v2))        list(k3, v3)The Mapper takes the input data, filters and transforms into something The Reducercan aggregate overFirst Map, Then ReduceMapReduce
MapReduceShuffleInput
Hadoop comes with a number of predefined classesBooleanWritableByteWritableLongWritableText, etc…Supports pluggable serialization frameworksApache Avro Hadoop Data TypesMapReduce
TextInputFormat / TextOutputFormatKeyValueTextInputFormatSequenceFile - A Hadoopspecific compressed binary file format. Optimized for passing data between 2 MapReduce jobsInput / Output FormatsMapReduce
publicstaticclass MapClass extends MapReduceBaseprivateText word = new Text();publicvoid map(LongWritable key, Text value,		     OutputCollector<Text,IntWritable> output, …){String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){word.set(itr.nextToken());output.collect(word,newIntWritable(1));	}  }  }    Word Count – The Mapperimplements Mapper<LongWritable,Text,Text,IntWritable>< Hello, 1> < World, 1> < Bye, 1> < World, 1> <K1,Hello World Bye World>
publicstaticclassReduceClassextends MapReduceBasepublicvoidreduce(Text key, Iterator<IntWritable>     	values, OutputCollector<Text,IntWritable> output,…){intsum = 0;while(values.hasNext()){sum += values.next().get(); }output.collect(key, new IntWritable(sum));{{Word Count– The ReducerimplementsReducer<Text,IntWritable,Text,IntWritable>{< Hello, 1> < World, 2> < Bye, 1> < Hello, 1> < World, 1> < Bye, 1> < World, 1>
publicstaticvoid main(String[] args){JobConf job = newJobConf(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,newPath(args[1]));//job.setInputFormat(KeyValueTextInputFormat.class);JobClient.runJob(job);{Word Count – The Driver
Music discovery websiteScrobbling / Streaming VIA radio40M unique visitors per monthOver 40M scrobbles per dayEach scrobble creates a log lineHadoop @ Last.FMMapReduce
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Goal : Create a “Unique listeners per track” chartSample listening dataMapReduce
publicvoid map(LongWritable position, Text rawLine, 				   OutputCollector<IntWritable,IntWritable> output, 			   Reporter reporter) throwsIOException { intscrobbles, radioListens; // assume they are initialized -IntWritabletrackId,userId; // for verbosity   // if track somehow is marked with zero plays - ignoreif (scrobbles <= 0 && radioListens <= 0) {return; 	}// output user id against track idoutput.collect(trackId, userId); }Unique Listens - Mapper
publicvoid reduce(IntWritabletrackId, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, 			Reporter reporter) throwsIOException {  Set<Integer> usersSet = newHashSet<Integer>();// add all userIds to the set, duplicates removedwhile (values.hasNext()) {IntWritableuserId = values.next();usersSet.add(userId.get());  }// output: trackId -> number of unique listeners per trackoutput.collect(trackId, newIntWritable(usersSet.size()));}Unique Listens - Reducer
Complex tasks will sometimes be needed to be broken down to subtasksOutput of the previous job goes as input to the next jobjob-a | job-b | job-cSimply launch the driver of the 2nd job after the 1stChainingMapReduce
Hadoop supports other languages via API called StreamingUse UNIX commands as mappers and reducersOr use any script that processes line-oriented data stream from STDIN and outputs to STDOUTPython, Perl etc.Hadoop StreamingMapReduce
 $ hadoop jar hadoop-streaming.jar        -input input/myFile.txt      -output output.txt       -mapper myMapper.py      -reducer myReducer.pyHadoop StreamingMapReduce
HDFSHadoop Distributed File System
A large dataset can and will outgrow the storage capacity of a single physical machinePartition it across separate machines – Distributed FileSystemsNetwork based - complexWhat happens when a node fails?Distributed FileSystemHDFS
Designed for storing very large files running on clusters on commodity hardwareHighly fault-tolerant (via replication)A typical file is gigabytes to terabytes in sizeHigh throughputHDFS - Hadoop Distributed FileSystemHDFS
Running Hadoop = Running a set of daemons ondifferent servers in your networkNameNodeDataNodeSecondary NameNodeJobTrackerTaskTrackerHadoop’s Building BlocksHDFS
Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
HDFS has a master/slave architecture ; The NameNode acts as the masterSingle NameNode per HDFSKeeps track of :How the files are broken into blocksWhich nodes store those blocksThe overall health of the filesystemMemory and I/O intensiveThe NameNodeHDFS
Each slave machine will host a DataNode daemonServes read/write/delete requests from the NameNodeManages the storage attached to the nodes Sends a periodic Heartbeat to the NameNodeThe DataNodeHDFS
Failure is the norm rather than exceptionDetection of faults and quick, automatic recoveryEach file is stored as a sequence of blocks (default: 64MB each)The blocks of a file are replicated for fault toleranceBlock size and replicas are configurable per fileFault Tolerance - ReplicationHDFS
HDFS
Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
Assistant daemon that should be on a dedicated nodeTakes snapshots of the HDFS metadataDoesn’t receive real time changesHelps minimizing downtime incase the NameNode crashesSecondary NameNodeHDFS
Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
One per cluster - on the master nodeReceives job request submitted by the clientSchedules and monitors MapReduce jobs on TaskTrackersJobTrackerHDFS
Run map and reduce tasksSend progress reports to the JobTrackerTaskTrackerHDFS
VIA file commands$ hadoopfs   -mkdir    /user/chuck$ hadoopfs   -put        hugeFile.txt$ hadoopfs   -get       anotherHugeFile.txtProgrammatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){    out.write(buffer,0,bytesRead);}Working with HDFSHDFS
Tips & Tricks
Tip #1: Hadoop Configuration TypesTips & Tricks
Monitoring events in the cluster can prove to be a bit more difficultWeb interface for our clusterShows a summary of the clusterDetails about list of jobs there are currently running, completed and failedTip #2: JobTracker UI Tips & Tricks
WebTracker UI SSTips & Tricks
Digging through logs or….	Running again the exact same scenario with the same input on the same node?IsolationRunner can rerun the failed task to reproduce the problemAttach a debugger Keep.failed.tasks.file= trueTip #3: IsolationRunner – Hadoop’s Time MachineTips & Tricks
Output of the map phase (which will be shuffled across the network) can be quite largeBuilt in support for compressionDifferent codecs : gzip, bzip2 etcTransparent to the developerconf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);Tip #4: CompressionTips & Tricks
A node can experience a slowdown, thus slowing down the entire jobIf a task is identified as “slow”, it will be scheduled to run in another node in parallelAs soon as one finishes successfully, the others will be killedAn optimization – not a featureTip #5: Speculative ExecutionTips & Tricks
Input can come from 2 (or more) different sourcesHadoop has a contrib package called datajoinGeneric framework for performing reduce-side joinTip #6: DataJoin PackageMapReduce
Hadoop in the CloudAmazon Web Services
Cloud computing - Shared resources and information are provided on demandRent a cluster rather than buy itThe best known infrastructure for cloud computing is Amazon Web Services (AWS)Launched at July 2002Cloud Computing and AWSHadoop in the Cloud
Elastic Compute Cloud (EC2)A large farm of VMs where a user can rent and use them to run a computer applicationWide range on instance types to choose from (price varies)Simple Storage Service (S3) – Online storage for persisting MapReduce data for future useHadoop comes with built in support for EC2 and S3$ hadoop-ec2  launch-cluster  <cluster-name> <num-of-slaves> Hadoop in the Cloud – Core Services
EC2 Data FlowHDFSEC2MapReduce TasksOurData
EC2 & S3 Data FlowS3OurDataHDFSEC2MapReduce Tasks
Hadoop-Related Projects
Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivialPig simplifies Hadoop programmingProvides high-level data processing language : Pig LatinBeing used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc..Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25PigHadoop-Related Projects
Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BYage >= 18 AND age <= 25;Pages = LOAD ‘pages.csv’ AS (user, url);Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;STORE Top5 INTO ‘top5sites.csv’;Pig Latin – Data Flow Language
A data warehousing package built on top of HadoopSQL-like queries on large datasets HiveHadoop-Related Projects
Hadoop database for random read/write accessUses HDFS as the underlying file systemSupports billions of rows and millions of columnsFacebook chose HBase as a framework for their new version of “Messages”HBaseHadoop-Related Projects
A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backportsClouderaHadoop-Related Projects
Machine learning algorithms for HadoopComing up next..  (-:MahoutHadoop-Related Projects
Big Data can and will cause serious scalability problems to your applicationMapReduce for analysis, Distributed filesystem for storageHadoop = MapReduce + HDFS and much moreAWS integration is easyLots of documentationLast wordsSummary
Hadoop in Action / Chuck LamHadoop: The Definitive Guide, 2nd Edition /                  Tom White (O’reilly)Apache Hadoop DocumentationHadoop @ Last.FM Presentation MapReduce in Simple Terms / SaliyaEkanayakeAmazon Web ServicesReferences

More Related Content

TheEdge10 : Big Data is Here - Hadoop to the Rescue

  • 1. Big Data is Here – Hadoop to the Rescue!Shay Sofer,AlphaCSP
  • 2. Today we will:Understand what is BigDataGet to know HadoopExperience some MapReduce magicPersist very large filesLearn some nifty tricksOn Today's Menu...
  • 4. IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021)60% Growth from 2009By 2020 – we will reach 35 ZBFacts and NumbersData is Everywhere
  • 5. Facts and NumbersData is EverywhereSource: www.idc.com
  • 6. 234M Web sites7M New sites in 2009New York Stock Exchange – 1 TB of data per dayWeb 2.0147M Blogs (and counting…)Twitter – ~12 TB of data per dayFacts and NumbersData is Everywhere
  • 7. 500M users40M photos per day More than 30billion pieces of content (web links, news stories, blog posts, notes, photo albums etc.) shared each monthFacts and Numbers - FacebookData is Everywhere
  • 8. Big dataare datasets that grow so large that they become awkward to work with using on-hand database management toolsWhere and how do we store this information?How do we perform analyses on such large datasets?Why are you here?Data is Everywhere
  • 10. Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computerScale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software applicationScale-up Vs. Scale-outData is Everywhere
  • 12. A framework for writing and running distributed applications that process large amount of data.Runs on large clusters of commodity hardwareA cluster with hundreds of machine is standardInspired by Google’s architecture : MapReduce and GFSWhat is Hadoop?Hadoop
  • 13. Robust - Handles failures of individual nodesScales linearlyOpen source A top-level Apache projectWhy Hadoop?Hadoop
  • 15. Facebook holds the largest known Hadoop storage cluster in the world2000 machines12 TB per machine (some has 24 TB)32 GB of RAM per machineTotal of more than 21 Petabytes (1 Petabyte = 1024 Terabytes) Facebook (Again…)Hadoop
  • 16. HistoryHadoopApache Nutch – Open Source web search engine founded by Doug CuttingCutting joins Yahoo!, forms HadoopSorting 1 TB in 62 seconds200420062008200820022010Google’s GFS & MapReduce papers publishedCreating the longest Pi yetHadoop hits web scale, being used by Yahoo! for web indexing
  • 20. A programming model for processing and generating large data setsIntroduced by Google Parallel processing of the map/reduce operationsDefinitionMapReduce
  • 21. Sam believed “An apple a day keeps a doctor away”MapReduce – The Story of SamMotherSamAn AppleSource: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 22. Sam thought of “drinking” the appleMapReduce – The Story of SamHe used a to cut the and a to make juice. Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 23. Sam applied his invention to all the fruits he could find in the fruit basket(map ‘( )) MapReduce – The Story of SamA list of values mapped into another list of values, which gets reduced into a single value( ) (reduce ‘( )) Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 24. MapReduce – The Story of SamSam got his first job for his talent in making juiceFruitsNow, it’s not just one basket but a whole container of fruitsLargedata and list of values for outputAlso, they produce alist of juice types separately
  • 25. But, Sam had just ONE and ONE Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 26. MapReduce – The Story of SamSam Implemented a parallelversion of his innovation Each map input: list of <key, value> pairsFruits(<a, > , <o, > , <p ,> , …)MapEach map output: list of <key, value> pairs(<a’ , > , <o’, v > , <p’ , > , …)Grouped by key (shuffle)Each reduce input: <key, value-list>e.g. <a’, ( …)>ReduceReduced into a list of valuesSource: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
  • 27. Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs (k1, v1) list(k2, v2)Reducer- Iterates through the values that are associated with a specific key and generate output (k2, list (v2)) list(k3, v3)The Mapper takes the input data, filters and transforms into something The Reducercan aggregate overFirst Map, Then ReduceMapReduce
  • 29. Hadoop comes with a number of predefined classesBooleanWritableByteWritableLongWritableText, etc…Supports pluggable serialization frameworksApache Avro Hadoop Data TypesMapReduce
  • 30. TextInputFormat / TextOutputFormatKeyValueTextInputFormatSequenceFile - A Hadoopspecific compressed binary file format. Optimized for passing data between 2 MapReduce jobsInput / Output FormatsMapReduce
  • 31. publicstaticclass MapClass extends MapReduceBaseprivateText word = new Text();publicvoid map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …){String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){word.set(itr.nextToken());output.collect(word,newIntWritable(1)); } } } Word Count – The Mapperimplements Mapper<LongWritable,Text,Text,IntWritable>< Hello, 1> < World, 1> < Bye, 1> < World, 1> <K1,Hello World Bye World>
  • 32. publicstaticclassReduceClassextends MapReduceBasepublicvoidreduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,…){intsum = 0;while(values.hasNext()){sum += values.next().get(); }output.collect(key, new IntWritable(sum));{{Word Count– The ReducerimplementsReducer<Text,IntWritable,Text,IntWritable>{< Hello, 1> < World, 2> < Bye, 1> < Hello, 1> < World, 1> < Bye, 1> < World, 1>
  • 33. publicstaticvoid main(String[] args){JobConf job = newJobConf(WordCount.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,newPath(args[1]));//job.setInputFormat(KeyValueTextInputFormat.class);JobClient.runJob(job);{Word Count – The Driver
  • 34. Music discovery websiteScrobbling / Streaming VIA radio40M unique visitors per monthOver 40M scrobbles per dayEach scrobble creates a log lineHadoop @ Last.FMMapReduce
  • 36. Goal : Create a “Unique listeners per track” chartSample listening dataMapReduce
  • 37. publicvoid map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable> output, Reporter reporter) throwsIOException { intscrobbles, radioListens; // assume they are initialized -IntWritabletrackId,userId; // for verbosity // if track somehow is marked with zero plays - ignoreif (scrobbles <= 0 && radioListens <= 0) {return; }// output user id against track idoutput.collect(trackId, userId); }Unique Listens - Mapper
  • 38. publicvoid reduce(IntWritabletrackId, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throwsIOException { Set<Integer> usersSet = newHashSet<Integer>();// add all userIds to the set, duplicates removedwhile (values.hasNext()) {IntWritableuserId = values.next();usersSet.add(userId.get()); }// output: trackId -> number of unique listeners per trackoutput.collect(trackId, newIntWritable(usersSet.size()));}Unique Listens - Reducer
  • 39. Complex tasks will sometimes be needed to be broken down to subtasksOutput of the previous job goes as input to the next jobjob-a | job-b | job-cSimply launch the driver of the 2nd job after the 1stChainingMapReduce
  • 40. Hadoop supports other languages via API called StreamingUse UNIX commands as mappers and reducersOr use any script that processes line-oriented data stream from STDIN and outputs to STDOUTPython, Perl etc.Hadoop StreamingMapReduce
  • 41. $ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.pyHadoop StreamingMapReduce
  • 43. A large dataset can and will outgrow the storage capacity of a single physical machinePartition it across separate machines – Distributed FileSystemsNetwork based - complexWhat happens when a node fails?Distributed FileSystemHDFS
  • 44. Designed for storing very large files running on clusters on commodity hardwareHighly fault-tolerant (via replication)A typical file is gigabytes to terabytes in sizeHigh throughputHDFS - Hadoop Distributed FileSystemHDFS
  • 45. Running Hadoop = Running a set of daemons ondifferent servers in your networkNameNodeDataNodeSecondary NameNodeJobTrackerTaskTrackerHadoop’s Building BlocksHDFS
  • 46. Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
  • 47. HDFS has a master/slave architecture ; The NameNode acts as the masterSingle NameNode per HDFSKeeps track of :How the files are broken into blocksWhich nodes store those blocksThe overall health of the filesystemMemory and I/O intensiveThe NameNodeHDFS
  • 48. Each slave machine will host a DataNode daemonServes read/write/delete requests from the NameNodeManages the storage attached to the nodes Sends a periodic Heartbeat to the NameNodeThe DataNodeHDFS
  • 49. Failure is the norm rather than exceptionDetection of faults and quick, automatic recoveryEach file is stored as a sequence of blocks (default: 64MB each)The blocks of a file are replicated for fault toleranceBlock size and replicas are configurable per fileFault Tolerance - ReplicationHDFS
  • 50. HDFS
  • 51. Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
  • 52. Assistant daemon that should be on a dedicated nodeTakes snapshots of the HDFS metadataDoesn’t receive real time changesHelps minimizing downtime incase the NameNode crashesSecondary NameNodeHDFS
  • 53. Topology of a Hadoop ClusterSecondary NameNodeNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
  • 54. One per cluster - on the master nodeReceives job request submitted by the clientSchedules and monitors MapReduce jobs on TaskTrackersJobTrackerHDFS
  • 55. Run map and reduce tasksSend progress reports to the JobTrackerTaskTrackerHDFS
  • 56. VIA file commands$ hadoopfs -mkdir /user/chuck$ hadoopfs -put hugeFile.txt$ hadoopfs -get anotherHugeFile.txtProgrammatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){ out.write(buffer,0,bytesRead);}Working with HDFSHDFS
  • 58. Tip #1: Hadoop Configuration TypesTips & Tricks
  • 59. Monitoring events in the cluster can prove to be a bit more difficultWeb interface for our clusterShows a summary of the clusterDetails about list of jobs there are currently running, completed and failedTip #2: JobTracker UI Tips & Tricks
  • 61. Digging through logs or…. Running again the exact same scenario with the same input on the same node?IsolationRunner can rerun the failed task to reproduce the problemAttach a debugger Keep.failed.tasks.file= trueTip #3: IsolationRunner – Hadoop’s Time MachineTips & Tricks
  • 62. Output of the map phase (which will be shuffled across the network) can be quite largeBuilt in support for compressionDifferent codecs : gzip, bzip2 etcTransparent to the developerconf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);Tip #4: CompressionTips & Tricks
  • 63. A node can experience a slowdown, thus slowing down the entire jobIf a task is identified as “slow”, it will be scheduled to run in another node in parallelAs soon as one finishes successfully, the others will be killedAn optimization – not a featureTip #5: Speculative ExecutionTips & Tricks
  • 64. Input can come from 2 (or more) different sourcesHadoop has a contrib package called datajoinGeneric framework for performing reduce-side joinTip #6: DataJoin PackageMapReduce
  • 65. Hadoop in the CloudAmazon Web Services
  • 66. Cloud computing - Shared resources and information are provided on demandRent a cluster rather than buy itThe best known infrastructure for cloud computing is Amazon Web Services (AWS)Launched at July 2002Cloud Computing and AWSHadoop in the Cloud
  • 67. Elastic Compute Cloud (EC2)A large farm of VMs where a user can rent and use them to run a computer applicationWide range on instance types to choose from (price varies)Simple Storage Service (S3) – Online storage for persisting MapReduce data for future useHadoop comes with built in support for EC2 and S3$ hadoop-ec2 launch-cluster <cluster-name> <num-of-slaves> Hadoop in the Cloud – Core Services
  • 69. EC2 & S3 Data FlowS3OurDataHDFSEC2MapReduce Tasks
  • 71. Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivialPig simplifies Hadoop programmingProvides high-level data processing language : Pig LatinBeing used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc..Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25PigHadoop-Related Projects
  • 72. Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BYage >= 18 AND age <= 25;Pages = LOAD ‘pages.csv’ AS (user, url);Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;STORE Top5 INTO ‘top5sites.csv’;Pig Latin – Data Flow Language
  • 73. A data warehousing package built on top of HadoopSQL-like queries on large datasets HiveHadoop-Related Projects
  • 74. Hadoop database for random read/write accessUses HDFS as the underlying file systemSupports billions of rows and millions of columnsFacebook chose HBase as a framework for their new version of “Messages”HBaseHadoop-Related Projects
  • 75. A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backportsClouderaHadoop-Related Projects
  • 76. Machine learning algorithms for HadoopComing up next.. (-:MahoutHadoop-Related Projects
  • 77. Big Data can and will cause serious scalability problems to your applicationMapReduce for analysis, Distributed filesystem for storageHadoop = MapReduce + HDFS and much moreAWS integration is easyLots of documentationLast wordsSummary
  • 78. Hadoop in Action / Chuck LamHadoop: The Definitive Guide, 2nd Edition / Tom White (O’reilly)Apache Hadoop DocumentationHadoop @ Last.FM Presentation MapReduce in Simple Terms / SaliyaEkanayakeAmazon Web ServicesReferences
  • 79. Thank you!

Editor's Notes

  1. Lets say we have a huge log file that we need to analyzeOr that we have large amount of data that we need to store – what shall we do?Add more CPUs or memory to single machine (node?)Or add more nodes to the system?
  2. Talk about commodity hardware&quot;Commodity&quot; hardware is hardware that is easily and affordably available. A device that is said to use &quot;commodity hardware&quot; is one that uses components that were previously available or designed and are thus not necessarily unique to that device. Unfortunately, at some point there won’t be a big enoughmachine available for the larger data sets. More importantly, the high-end machinesare not cost effective for many applications. For example, a machine with four timesthe power of a standard PC costs a lot more than putting four such PCs in a cluster.Spreading and dividing the data between many machines will provide a much higher throughput - distributed software application
  3. 1. Because it runs on commodity hardware2. With X2 machines it will run close to X2 faster4. Talk about it being mature and very popular (then move to next slide)
  4. http://wiki.apache.org/hadoop/PoweredByStories: New york times – converted 11 million articles from TIFF images to PDFTwitter - We use Hadoop to store and process tweets, log files
  5. http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html
  6. Cutting is also the founder of Apache Lucene, popular text search library
  7. Although we’ve spoken about key and values we have yet to discuss their typesSpeak about why java serialization is bad , about pluggable frameworks and about using predefined hadoop.http://tmrp.javaeye.com/blog/552696
  8. Those 2 static classes will reside in a single file. Those inner classses are independent – during job execution the Mapper and Reducer are replicated and run in various nodes in different JVMs
  9. Executing 2 jobs manually is possible, but its more convenient to automate itThe input of the 2nd will be the output of the firstJobClient.runJob() is blocking.
  10. Useful for writing simple short programs that are rapidly developed in scriptsOr writing programs that can take advantage of non-Java libraries
  11. Useful for writing simple short programs that are rapidly developed in scriptsOr writing programs that can take advantage of non-Java libraries
  12. it becomes necessary to partition it across a number of separate machines. Filesystems that managethe storage across a network of machines are called distributed filesystems. Since theyare network-based, all the complications of network programming kick in, thus makingdistributed filesystems more complex than regular disk filesystems. For example, oneof the biggest challenges is making the filesystem tolerate node failure without suffering data loss.
  13. Lets say you have a 100TB in a file – HDFS abstracts the complexity and give you the illusion that you&apos;re dealing with a single fileVery large files“Very large” in this context means files that are hundreds of megabytes, gigabytes,or terabytes in size. There are Hadoop clusters running today that store petabytesof data.*Streaming data accessHDFS is built around the idea that the most efficient data processing pattern is awrite-once, read-many-times pattern. A dataset is typically generated or copiedfrom source, then various analyses are performed on that dataset over time. Eachanalysis will involve a large proportion, if not all, of the dataset, so the time to readthe whole dataset is more important than the latency in reading the first record.Commodity hardwareHadoop doesn’t require expensive, highly reliable hardware to run on. It’s designedto run on clusters of commodity hardware (commonly available hardware availablefrom multiple vendors†) for which the chance of node failure across the cluster ishigh, at least for large clusters. HDFS is designed to carry on working without anoticeable interruption to the user in the face of such failure
  14. Some of them exist only on one server and some across all servers
  15. The most important daemonMake sure that the server hosting the NameNode will not store any data locally or perform any computations for a MapReduce program
  16. Constantly reports to the namenode – informs the namenode of which blocks it is currently storingDatanodes also poll the namenode to provide information regarding local changes as well as receiving instructions to create move or delete blocks
  17. all blocks in a file except the last block are the same size
  18. One important aspect of this design is that the client contacts datanodes directly toretrieve data and is guided by the namenode to the best datanode for each block. Thisdesign allows HDFS to scale to a large number of concurrent clients, since the datatraffic is spread across all the datanodes in the cluster. The namenode meanwhile merelyhas to service block location requests (which it stores in memory, making them veryefficient) and does not, for example, serve data, which would quickly become a bot-tleneck as the number of clients grew.
  19. One per clusterTakes snapshots by communicating with the namenode at interval specified by configuration
  20. Relaunch possibly on a different node – up to num of retries
  21. If JobTrackerfails to receive a message from a TaskTracker it assumes failure and submit the task to other nodes
  22. -help to look for help
  23. Local mode is to assist debugging and create the logicUsually the local mode and the Pseudo do mode will work on a subset of dataPseudo is “cluster of one”Switching is easy
  24. If there are bugs that sometimes cause a task to hand or slow down then relying on the speculative execution to avoid these problems is unwise and won’t work reliably since the same bugs are likely to affect the speculative task
  25. You may have a few large data processing jobs that occasionally take advantageof hundreds of nodes, but those same nodes will sit idle the rest of the time.You may be new to Hadoop and want to get familiar with it first before investing ina dedicated cluster. You may own a startup that needs to conserve cash and wantsto avoid the capital expense of a Hadoop cluster. In these and other situations, itmakes more sense to rent a cluster of machines rather than buy it.You can rent computing and storage services from AWS on demand as your requirement scales.As of this writing, renting a compute unit with the equivalent power of a 1.0 GHz32-bit Opteron with 1.7 GB RAM and 160 GB disk storage costs $0.10(varies if its windows or Unix instance) per hour. Using acluster of 100 such machines for an hour will cost a measly $10!
  26. Supported operating systems onEC2 include more than six variants of Linux, plus Windows Server and OpenSolaris.Other images include one of the operating systems plus pre-installed software, such asdatabase server, Apache HTTP server, Java application server, and others. AWS offerspreconfigured images of Hadoop running on Linux,
  27. Load users, Load pagesFilter by ageJoin by nameGroup on URLCount click, sort clicks , get top 5
  28. 1.Under the covers pig turns the tranformations into a series of MapReduce jobs but we as programmersAre unaware of this which allows us to focus on the data rather than the nature of execution2. Pig Latin is a data flow pgoramming language where SQL is a declarative programming languate. Ping Latin program is a tep by step set of perations on an input SQL statements is a a set of statements taken together and produce output3. A script file – also pluginPigPen for eclipse exists
  29. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
  30. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
  31. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns
  32. HBase – uses HDFS as the underlying file system. Supports of billions of rows and millions of columns