Big data is characterized by 3Vs - volume, velocity, and variety. Hadoop is a framework for distributed processing of large datasets across clusters of computers. It provides HDFS for storage, MapReduce for batch processing, and YARN for resource management. Additional tools like Spark, Mahout, and Zeppelin can be used for real-time processing, machine learning, and data visualization respectively on Hadoop. Benefits of Hadoop include ease of scaling to large data, high performance via parallel processing, reliability through data protection and failover.
2. Definition
Big data can defined by 3Vs (Three Vs):
– Volume, starts as low as 1 terabyte and it has no upper limit.
– Velocity, data volume per unit time, should be at least 30KB/sec.
– Variety, add unstructured & semi-structured to structured data.
3. Volume
Big data is composed of huge numbers of very small transactions that come in variety
formats.
Data produce true value only after they’re aggregated and analyzed.
4. Velocity
Required latency less than 100ms, measured from the time data is created to time has
the responds.
Throughput requirement can easily be as high as 1,000 messages per second.
5. Variety
Composing of a combination of datasets with differing underlying structures (structured,
semi-structured, or structured).
Heterogeneous format: graphics, JSON, XML, CSV, and log files.
7. Problem
■ Managing volume of data, caused by overloading volume
■ Maintaining system performance, caused by low velocity of data access
■ Avoiding disjunction of data, caused by variety of data structure and formats.
How to accomplish those?
9. Maintaining Performance
■ For Batch Processing, use Hadoop MapReduce
■ For Stream Processing, use Apache Spark, Apache Storm, Apache Drill
10. Avoiding Disjunction
■ Use Flat Storage Architecture / Data Lake, to hold huge volumes of multi-structured
data.
■ Use Hadoop Distributed File System (HDFS), to deploy to the machine.
12. What is Hadoop?
A framework that allows for the distributed processing of large data sets across clusters
of computers using simple programming models.
13. Why Hadoop?
■ Most proven framework in industry nowadays
■ Open Source
■ Rich features & functionalities
■ Rich support plugins in the ecosystem
=> (https://hadoopecosystemtable.github.io/)
14. When to use Hadoop?
■ For processing large volumes of data
■ For parallel data processing
■ For storing a diverse set of data
15. When not to use Hadoop?
■ For a relational database system
■ For a general network network file system
■ For non-parallel data processing
17. Data Storage
Hadoop use HDFS (Hadoop Distributed File System) to store the data.
HDFS is a distributed file system designed to fault tolerant and deployed on low-cost
hardware
18. HDFS’ Goals
■ Hardware Failure, detection of faults and quick automatic recovery.
■ Streaming Data Access, emphasis on high throughput of data access.
■ Large Data Sets, provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster and support tens of millions of files in a single instance.
■ Simple Coherency Model, HDFS applications need a write-once-read-many access
model for files.
■ Portability, designed to be easily move from one platform to another.
19. Data Processing
Data processing has two ways, batch and real-time.
■ Batch processing, execution of a series of jobs.
– Use Hadoop MapReduce
■ Real-time processing, execution of instantaneously jobs.
– Use Apache Spark
22. Resource Management
Manage all resources in the Hadoop cluster, to monitor if there are any faults, job
scheduling, and do quick automatic recovery.
Hadoop use YARN.
24. YARN (cot’d)
■ The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues, etc.
– Performs no monitoring or tracking of status for the application
– No guarantees about restarting failed tasks either due to application failure or
hardware failures
– Performs its scheduling function based on the resource requirements of the
applications
25. Analyzing Data
Process of inspecting, cleansing, transforming, and modeling data with the goal of
discovering useful information, suggesting conclusions, and supporting decision-
making.
The goal of analyzing data is to leverage your business to grow more higher.
Hadoop support this activity with the help from Apache Mahout.
29. Benefits
Hadoop give some benefits:
■ Ease of scaling
Hadoop is designed as a distributed system
■ Performance
Hadoop is designed to works in distributed & parallel processing
■ Availability & Reliability
Hadoop platform is providing data protection and automatic failover
configuration
30. Conclusion
■ Big data is not a barrier, but only a data that need to be managed properly.
■ Used a proper tools to managed them.
■ Prepare the strategy to processing the data (batch or stream).
■ Managed & maintain the system carefully.
■ Use plugins that needed by functional requirements.
■ Grow your business with Data-Driven Approach
RDBMS could scale on read operation, but in write operation you need to drop ACID requirements which is violated RDBMS core rules.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Batch: “where data is collected and then processed as one unit with processing completion times on the order of hours or days”