Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Hadoop
        Framework for Distributed Applications




                                        Nishant M Gandhi
                                BE 4th YEAR Comp. Eng.
C K Pithawalla College of Engineering & Technology,Surat.
Hadoop

• Introduction
• History
• Key Technologies
  – MapReduce
  – HDFS
• Other Projects On Hadoop
• Conclusion
Introduction:

What is                       ?
 Hadoop is a framework for running applications on large clusters
 built of commodity hardware.
                          ----HADOOP WIKI


  Hadoop is a free, Java-based programming framework that
  supports the processing of large data sets in a distributed
  computing environment.
Introduction (conti..)


 #1 Open Source
 #2 Part of Apache group
 #3 Power of JAVA
 #4 Supported By Big Web Giant Companies




#1 Google’s Powerful Computation MapReduce Technology
#2 Hadoop Distributed File System(HDFS) inspired by Google File
System(GFS)
#3 Used for Cluster & Distributed Computing
#4 Support from…
History:
Inventor Doug Cutting, creator of Apache Lucene

The Origin of the Name “Hadoop”:

The name my kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless, and not used elsewhere: those are my
naming criteria. ---Daug Cutting.

Started with building Web Search Engine
   •Nutch in 2002
   •Aim was to index billions of pages
   •Architecture can’t support billions of pages
Google’s GFS in 2003 solved storage problem
   •Nutch Distributed Filesystem(NDFS) in 2004
Google’s MapReduce in 2004
   •MapReduce implimented in Nutch 2005
Feb 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
History (conti..)
At around the same time, Doug Cutting joined Yahoo
February 2008 , Yahoo! announced that its production search index
was being generated by a 10,000-core Hadoop cluster




 In January 2008, Hadoop was made its own top-level project at
 apache, confirming its success and its diverse, active community.
 By this time Hadoop was being used by many other companies
 besides Yahoo! such as
     • Last.fm
     • Facebook
     • The New York Times
     • Twitter
     • Microsoft
     • IBM
Key Technologies:
   •MapReduce
     -Computational Parallel Programming Model
     -Technology developed by google




   •Hadoop Distributed File System
     -Distributed File System for large data set
     -Inspired by Google File System
Key Technologies: MapReduce
Key Technologies: MapReduce

 •   Programming model developed at Google

 •   Sort/merge based distributed computing

 •   Initially, it was intended for their internal search/indexing
     application, but now used extensively by more organizations
     (e.g., Yahoo, Amazon.com, IBM, etc.)

 •   It is functional style programming (e.g., LISP) that is naturally
     parallelizable across a large cluster of workstations or PCS.

 •   The underlying system takes care of the partitioning of the
     input data, scheduling the program’s execution across several
     machines, handling machine failures, and managing required
     inter-machine communication. (This is the key for Hadoop’s
     success)
Key Technologies: HDFS

 At Google MapReduce operation are run on a special file system
  called Google File System (GFS) that is highly optimized for this
  purpose.

 GFS is not open source.

 Doug Cutting and others at Yahoo! reverse engineered the GFS
  and called it Hadoop Distributed File System (HDFS).
Key Technologies: HDFS
Key Technologies: HDFS

  •   Very Large Distributed File System
      – 10K nodes, 100 million files, 10 PB
  •   Assumes Commodity Hardware
      – Files are replicated to handle hardware failure
      – Detect failures and recovers from them
  •   Optimized for Batch Processing
      – Data locations exposed so that computations can move to
      where data resides
      – Provides very high aggregate bandwidth
  •   User Space, runs on heterogeneous OS
Other Projects on Hadoop:

        ZooKeeper: co-ordination services



        Pig: A high-level data-flow language and execution
        framework for parallel computation.



        Hive:A data warehouse infrastructure that provides
        data summarization and ad hoc querying.



         Chukwa: A data collection system for managing
         large distributed systems.
Other Projects on Hadoop:

               Avro: Apache Avro is a data serialization system.
               Avro provides:
               •Rich data structures.
               •A compact, fast, binary data format.
               •A container file, to store persistent data.
               •Simple integration with dynamic languages.


               Just as Google's Bigtable leverages the
               distributed data storage provided by the
               Google File System, HBase provides
               Bigtable-like capabilities on top of
               Hadoop Core.
Hadoop Architecture on DELL C Series
Server:
Conclusion:



Hadoop has been very effective solution for companies dealing
 with the data in perabytes.

It has solved many problems in industry related to huge data
 management and distributed system.

As it is open source, so it is adopted by companies widely.
Thank You…..

More Related Content

Hadoop

  • 1. Hadoop Framework for Distributed Applications Nishant M Gandhi BE 4th YEAR Comp. Eng. C K Pithawalla College of Engineering & Technology,Surat.
  • 2. Hadoop • Introduction • History • Key Technologies – MapReduce – HDFS • Other Projects On Hadoop • Conclusion
  • 3. Introduction: What is ? Hadoop is a framework for running applications on large clusters built of commodity hardware. ----HADOOP WIKI Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
  • 4. Introduction (conti..) #1 Open Source #2 Part of Apache group #3 Power of JAVA #4 Supported By Big Web Giant Companies #1 Google’s Powerful Computation MapReduce Technology #2 Hadoop Distributed File System(HDFS) inspired by Google File System(GFS) #3 Used for Cluster & Distributed Computing #4 Support from…
  • 5. History: Inventor Doug Cutting, creator of Apache Lucene The Origin of the Name “Hadoop”: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. ---Daug Cutting. Started with building Web Search Engine •Nutch in 2002 •Aim was to index billions of pages •Architecture can’t support billions of pages Google’s GFS in 2003 solved storage problem •Nutch Distributed Filesystem(NDFS) in 2004 Google’s MapReduce in 2004 •MapReduce implimented in Nutch 2005 Feb 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.
  • 6. History (conti..) At around the same time, Doug Cutting joined Yahoo February 2008 , Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster In January 2008, Hadoop was made its own top-level project at apache, confirming its success and its diverse, active community. By this time Hadoop was being used by many other companies besides Yahoo! such as • Last.fm • Facebook • The New York Times • Twitter • Microsoft • IBM
  • 7. Key Technologies: •MapReduce -Computational Parallel Programming Model -Technology developed by google •Hadoop Distributed File System -Distributed File System for large data set -Inspired by Google File System
  • 9. Key Technologies: MapReduce • Programming model developed at Google • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) • It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. • The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success)
  • 10. Key Technologies: HDFS  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).
  • 12. Key Technologies: HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10 PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth • User Space, runs on heterogeneous OS
  • 13. Other Projects on Hadoop: ZooKeeper: co-ordination services Pig: A high-level data-flow language and execution framework for parallel computation. Hive:A data warehouse infrastructure that provides data summarization and ad hoc querying. Chukwa: A data collection system for managing large distributed systems.
  • 14. Other Projects on Hadoop: Avro: Apache Avro is a data serialization system. Avro provides: •Rich data structures. •A compact, fast, binary data format. •A container file, to store persistent data. •Simple integration with dynamic languages. Just as Google's Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop Core.
  • 15. Hadoop Architecture on DELL C Series Server:
  • 16. Conclusion: Hadoop has been very effective solution for companies dealing with the data in perabytes. It has solved many problems in industry related to huge data management and distributed system. As it is open source, so it is adopted by companies widely.