Big Data Technology Stack
Big Data Technology Stack
Technology Stack
By: Khalid Imran
Agenda
Big Data Stack : In a Nutshell
Data Layer
Data Processing Layer
Data Ingestion Layer
Data Layer
Hadoop Distributed File System (HDFS)
HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing
large volumes of data in inexpensive commodity hardware.
Apache HBase
HBase is an open-source NoSQL database that provides real-time read/write access to large
datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase
provides a strong consistency model, and range-based partitioning. Reads, including range-based
reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on
Cassandra..
Apache Solr
Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr
powers the search and navigation features of many of the worlds largest Internet sites, enabling
powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding
tabular, text, geo-location or sensor data that is stored in Hadoop.
Apache Mahout
Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on
top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of
artificial intelligence focused on enabling machines to learn without being explicitly programmed,
and it is commonly used to improve future performance based on previous outcomes. Mahout
provides the tools and algorithms to automatically find meaningful patterns in those big data sets
stored in the HDFS.
Apache Sqoop
Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or
mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS,
transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.
10
Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed
configuration service, a synchronization service and a naming registry for distributed systems that
can use Zookeeper to store and mediate updates to important configuration information.
11
12