HadoopfilePP
HadoopfilePP
HadoopfilePP
Experiment No.: 1
Experiment Title: Installing Hadoop, configure HDFS, Configuring Hadoop.
Date:
1. Objective
• To download, install, and configure Hadoop on a local machine or server.
• To understand different Hadoop operational modes (Standalone, Pseudo-Distributed, Fully
Distributed).
• To explore Hadoop’s start-up scripts and configuration files.
2. Theory
• Hadoop is an open-source framework that enables distributed storage and processing of large
datasets using simple programming models. It is designed to scale up from a single server to
thousands of machines.
• Hadoop Modes:
o Standalone Mode: The simplest mode, where Hadoop runs as a single Java process.
o Pseudo-Distributed Mode: Runs on a single node but mimics a fully distributed
environment by configuring Hadoop to treat the local filesystem as HDFS.
o Fully Distributed Mode: A true multi-node setup, where data is distributed across several
machines, offering full scalability and fault tolerance.
• Key Configuration Files:
o core-site.xml: Configures the HDFS and YARN address.
o hdfs-site.xml: Configures HDFS replication and storage directories.
o mapred-site.xml: Configures MapReduce framework settings.
o yarn-site.xml: Configures YARN resource management settings.
3. Requirements
• Software:
o Java Development Kit (JDK 8 or above)
o Hadoop (latest stable version, e.g., Hadoop 3.x)
• Hardware:
o Minimum 4 GB RAM
o Stable internet connection for downloading Hadoop packages
4. Procedure
1. Download Hadoop:
o Visit the Apache Hadoop official website: https://hadoop.apache.org/.
o Download the latest stable release of Hadoop (e.g., Hadoop 3.x).
2. Install Hadoop:
o Extract the downloaded Hadoop package to a desired directory (e.g., /usr/local/hadoop).
o Set environment variables in the . rc file:
Code
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
o Source the . rc file to apply the changes:
Code
source ~/. rc
Akash Dabi 21100BTCSE09726
Pre requisites:-
These software should be prepared to install Hadoop 2.9.2 on window 11 64bit
1. Download Hadoop 2.9.2
(Link:http://wwweu.apache.org/dist/hadoop/common/hadoop2.9.2/hadoop
-2.9.2.tar.gz
http://archive.apache.org/dist/hadoop/core//hadoop-2.9.2/hadoop-
2.9.2.tar.gz)
1) Check either Java 1.8.0 is already installed on your system or not, use "Javac -version" to
check.
2) If Java is not installed on your system then first install java under "C:\JAVA"
3) Extract file Hadoop 2.9.2.tar.gz or Hadoop-2.9.2.zip and place
under "C:\Hadoop-2.9.2".
4) Set the path HADOOP_HOME Environment variable on windows 11(see Step1,2,3 and 4
below).
5) Set the path JAVA_HOME Environment variable on windows 11(see Step1,2,3 and 4
below).
6) Next we set the Hadoop bin directory path and JAVA bin directory path.
Akash Dabi 21100BTCSE09726
Configuration of Hadoop
1) Edit file C:/Hadoop-2.9.2/etc/hadoop/core-site.xml, paste below xml paragraphand save this
file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Configuration of HDFS
4) Edit file C:\Hadoop-2.9.2/etc/hadoop/hdfs-site.xml, paste below xml paragraphand save this
file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.9.2\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.9.2\data\datanode</value>
</property>
</configuration>
</property>
</configuration>
Hadoop Configuration
1. Dowload file Hadoop Configuration.zip
2. Delete file bin on C:\Hadoop-2.9.2\bin, replaced by file bin on file justdownload
(from Hadoop Configuration.zip).
3. Open cmd and typing command "hdfs namenode –format" . You will see.
Testing
1. Open cmd and change directory to "C:\Hadoop-2.9.2\sbin" and type "start-all.cmd" to
start apache.
3. Open: http://localhost:8088
4. Open: http://localhost:9870
Akash Dabi 21100BTCSE09726
5. Experiment Code
• N/A (This experiment is focused on configuration rather than coding).
6. Execution
• After configuring the files, format the namenode and start the Hadoop services.
• Verify successful installation by accessing the HDFS and YARN UIs on the specified URLs.
7. Observations
• The NameNode UI at http://localhost:9870 should display Hadoop filesystem information.
• The ResourceManager UI at http://localhost:8088 should show the status of YARN resources and
applications.
8. Analysis
• In this setup, Hadoop is configured to run in pseudo-distributed mode, simulating a distributed
environment on a single machine.
• This setup provides insights into how data is stored, managed, and processed in HDFS and how
resources are allocated in YARN.
9. Conclusion
• Successfully installed and configured Hadoop and Hdfs in pseudo-distributed mode.
• Explored key configuration files and verified setup through Hadoop’s web UIs.
12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White
Akash Dabi 21100BTCSE09726
Experiment No.: 2
Experiment Title: Working on HDFS
Date:
1. Objective
• To explore basic file operations in Hadoop Distributed File System (HDFS).
• To perform tasks such as creating directories, uploading/downloading files, and managing files in
HDFS.
• To understand the architecture of HDFS and how it handles file storage and replication.
2. Theory
• HDFS (Hadoop Distributed File System): A distributed storage system in Hadoop designed to
store large datasets across multiple machines, ensuring fault tolerance and scalability.
• Replication: HDFS automatically replicates file blocks across different machines to ensure data
availability in case of node failure.
• HDFS Commands:
o hdfs dfs -mkdir <path>: Create a directory in HDFS.
o hdfs dfs -put <local_file> <hdfs_path>: Upload a file from the local system to HDFS.
o hdfs dfs -get <hdfs_file> <local_path>: Download a file from HDFS to the local system.
o hdfs dfs -ls <path>: List files and directories in HDFS.
o hdfs dfs -rm <path>: Delete files/directories in HDFS.
3. Requirements
• Software:
o Installed Hadoop in Pseudo-Distributed Mode
o Java Development Kit (JDK 8 or above)
• Hardware:
o Minimum 4 GB RAM
4. Procedure
1. Start Hadoop Services:
o Open a terminal and start the Hadoop services:
Code
start-dfs.sh
start-yarn.sh
Akash Dabi 21100BTCSE09726
Code
hdfs dfs -put sample.txt /user/<your-username>/input
Code
hdfs dfs -ls /user/<your-username>/input
5. Experiment Code
• N/A (This experiment involves running HDFS commands, not writing Code).
6. Execution
• Execute the above HDFS commands one by one in the terminal and observe the output.
• Ensure that the directories are created, files are uploaded/downloaded successfully, and files are
deleted as required.
7. Observations
• The ls command in HDFS shows the file structure, similar to the local file system.
• Files uploaded to HDFS are divided into blocks and replicated according to the replication factor.
• The default replication factor is 3, meaning each file block is stored in three different locations for
fault tolerance.
8. Analysis
• HDFS simplifies managing large datasets by providing automatic file replication and fault
tolerance.
Akash Dabi 21100BTCSE09726
• File operations in HDFS are similar to Unix/Linux file commands, making it easier to learn for
users familiar with these systems.
• The ability to handle large files across multiple nodes without user intervention is a key feature of
HDFS.
9. Conclusion
• Successfully explored basic HDFS file operations such as creating directories,
uploading/downloading files, and deleting files.
• HDFS provides a reliable distributed file system that ensures data availability and fault tolerance.
12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/stable/hadoop-project-
dist/hadoop-hdfs/HdfsUserGuide.html
• Hadoop: The Definitive Guide by Tom White
Akash Dabi 21100BTCSE09726
Experiment No.: 3
Experiment Title: Running Jobs on Hadoop
Date:
1. Objective
• To execute a MapReduce job in Hadoop for parallel data processing.
• To understand how Hadoop splits and processes large datasets using the MapReduce
programming model.
• To learn how to submit a job, monitor it, and verify its output on Hadoop’s distributed file system
(HDFS).
2. Theory
• MapReduce is a programming paradigm that allows for massive scalability across hundreds or
thousands of servers in a Hadoop cluster. It breaks down data processing tasks into smaller sub-
tasks that can be executed in parallel.
o Map Phase: The input dataset is split into chunks (or blocks). Each chunk is processed in
parallel by a mapper, which transforms the input into intermediate key-value pairs.
o Reduce Phase: The output from the Map phase is grouped by keys and passed to reducers,
which consolidate the intermediate outputs into the final results.
• Input Splits: Hadoop divides input data into blocks (usually 64MB or 128MB) for processing by
map tasks.
• JobTracker & TaskTracker: JobTracker manages MapReduce jobs, and TaskTrackers execute
tasks in the cluster. In YARN (introduced in Hadoop 2.x), the ResourceManager and
NodeManager manage resources and tasks.
3. Requirements
• Software:
o Java Development Kit (JDK 8 or above)
o Hadoop (version 2.x or 3.x) in pseudo-distributed mode
o Sample input text file (e.g., input.txt)
• Hardware:
o Minimum 4 GB RAM
o 20 GB free disk space
4. Procedure
1. Create a Sample Input File
o Open a text editor and create a simple text file (input.txt) containing sample data. For
instance:
Code
Hadoop is an open-source framework
Hadoop enables distributed data processing
Hadoop runs MapReduce jobs
2. Upload Data to HDFS
o Start the Hadoop services (HDFS and YARN).
Code
start-dfs.sh
start-yarn.sh
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String[] words = value.toString().split("\\s+");
for (String str : words) {
word.set(str);
context.write(word, one);
}
}
}
o Reducer: This class aggregates the mapper output and sums the values for each unique
key (word).
java
Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Java Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
Akash Dabi 21100BTCSE09726
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Code
javac -classpath `hadoop classpath` -d . WordCountMapper.java WordCountReducer.java
WordCount.java
jar -cvf wordcount.jar *.class
Code
hdfs dfs -cat /user/<your-username>/output/part-r-00000
5. Experiment Code
• Full Java source Code for the Mapper, Reducer, and Driver classes has been provided in Step 3
above.
6. Execution
• Submit the job using the Hadoop command, and observe the results in the specified HDFS output
directory.
• Ensure the Hadoop services (HDFS and YARN) are running, and monitor job progress using the
Hadoop web UI at:
o ResourceManager: http://localhost:8088
o HDFS NameNode UI: http://localhost:9870
Akash Dabi 21100BTCSE09726
7. Observations
• The input text file is split into smaller chunks, and multiple mappers process the data in parallel.
• The Reducer aggregates the results to produce the final word count output.
8. Analysis
• MapReduce provides a powerful mechanism to process and analyze large datasets in a distributed
and fault-tolerant manner.
• The job is divided into multiple tasks (map and reduce), and these tasks are executed across the
cluster in parallel, thus significantly improving processing time for large-scale data.
9. Conclusion
• Successfully executed a MapReduce job in Hadoop.
• The results of the job were verified by analyzing the output generated in HDFS.
• The experiment provided insights into the functionality and scalability of MapReduce for
distributed data processing.
12. References
1. Hadoop: The Definitive Guide by Tom White
2. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and
Other Systems by Donald Miner, Adam Shook
3. Official Apache Hadoop Documentation: https://hadoop.apache.org/
Akash Dabi 21100BTCSE09726
Experiment No.: 4
Experiment Title: Install Zookeeper
Date:
1. Objective
• To install and configure Apache Zookeeper.
• To understand Zookeeper’s role in distributed coordination.
2. Theory
• Apache Zookeeper: A centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. It is a key component in
Hadoop, especially in managing services like HBase, Kafka, and others.
• Zookeeper ensures distributed coordination by managing shared resources and providing
mechanisms like leader election, locks, and queues.
3. Requirements
• Zookeeper binary package
• Java Development Kit (JDK 8 or above)
4. Procedure
1. Download and Install Zookeeper
o Download the latest version of Zookeeper from the official Zookeeper site.
o Extract the archive and move to the Zookeeper directory.
2. Configure Zookeeper
o Copy the sample configuration file:
Code
cp conf/zoo_sample.cfg conf/zoo.cfg
o Edit zoo.cfg and set the data directory:
Code
dataDir=/path/to/zookeeper/data
3. Start Zookeeper
o Start the Zookeeper server using the following command:
Code
bin/zkServer.sh start
5. Experiment Code
• Configuration steps and Zookeeper commands (start, stop, status).
6. Execution
• Start the Zookeeper service and use the Zookeeper CLI to connect to the server.
7. Observations
• The server status should show that Zookeeper is running correctly, and you should be able to
interact with it using the CLI.
Akash Dabi 21100BTCSE09726
8. Analysis
• Zookeeper plays a crucial role in maintaining the coordination between distributed applications
and ensures data consistency.
9. Conclusion
• Successfully installed and configured Zookeeper, understanding its role in distributed
coordination.
12. References
1. Official Zookeeper Documentation: https://zookeeper.apache.org/
2. Hadoop Operations by Eric Sammer
3. ZooKeeper: Distributed Process Coordination by Flavio Junqueira, Benjamin Reed
Akash Dabi 21100BTCSE09726
Experiment No.: 5
Experiment Title: Pig Installation
Date:
1. Objective
• To install and configure Apache Pig.
• To execute basic Pig Latin scripts for data processing.
2. Theory
• Apache Pig: A platform for analyzing large datasets that uses a high-level language called Pig
Latin. It simplifies data processing by abstracting the complexity of MapReduce.
• Pig operates on data flow and is useful for ETL (Extract, Transform, Load) processes.
3. Requirements
• Apache Pig binary package
• Hadoop installed and configured
• Sample data file
4. Procedure
1. Download and Install Pig
1. Prerequisites
1.1. Hadoop Cluster Installation
Apache Pig is a platform build on the top of Hadoop. You can refer to our previously
published article to install a Hadoop single node cluster on Windows10.
1.2. 7zip
7zip is needed to extract .tar.gz archives we will be downloading in this guide.
2. Downloading Apache Pig
To download the Apache Pig, you should go to the following link:
• https://downloads.apache.org/pig/
• PIG_HOME: “E:\hadoop-env\pig-0.17.0”
Now, we should edit the Path user variable to add the following paths:
• %PIG_HOME%\bin.
Akash Dabi 21100BTCSE09726
Open a command prompt as administrator, and execute the following command pig -version
To fix this error, we should edit the pig.cmd file located in the “pig-0.17.0\bin”directory by
changing the HADOOP_BIN_PATH value from “%HADOOP_HOME%\bin” to
“%HADOOP_HOME%\libexec”.
Akash Dabi 21100BTCSE09726
The simplest way to write PigLatin statements is using Grunt shell which is an interactive
tool where we write a statement and get the desired output. There aretwo modes to involve
Grunt Shell:
1. Local: All scripts are executed on a single machine without requiringHadoop.
(command: pig -x local)
pig
Code
-- Load the data
data = LOAD '/user/<your-username>/input.txt' USING PigStorage(' ') AS
(word:chararray);
Code
pig wordcount.pig
4. Verify Output
o Check the output in the HDFS directory:
Code
hdfs dfs -cat /user/<your-username>/output/part-r-00000
5. Experiment Code
• Pig script to count the occurrence of words in an input file.
5. Execution
• Run the Pig script and verify that the word counts are correctly stored in the output directory.
7. Observations
• The Pig script simplifies data processing by abstracting MapReduce complexities.
8. Analysis
• Apache Pig is a highly efficient tool for processing large-scale datasets and can be used in ETL
processes.
9. Conclusion
• Successfully installed and configured Pig, and executed a Pig Latin script for word count.
12. References
1. Official Pig Documentation: https://pig.apache.org/
2. Programming Pig by Alan Gates
3. Hadoop: The Definitive Guide by Tom White
Akash Dabi 21100BTCSE09726
Experiment No.: 6
Experiment Title: Sqoop Installation
Date:
1. Objective
• To install and configure Apache Sqoop for data transfer between Hadoop and relational databases.
2. Theory
• Apache Sqoop: A tool designed for efficiently transferring bulk data between Hadoop and
relational databases such as MySQL, Oracle, or PostgreSQL.
• Sqoop can be used to import data from RDBMS into HDFS or export data from HDFS to
RDBMS.
3. Requirements
• Apache Sqoop binary package
• Relational database (e.g., MySQL)
• Hadoop installed and configured
4. Procedure
1. Download and Install Sqoop
Prerequisites:
1. Hardware Requirement
* RAM — Min. 8GB, if you have SSD in your system then 4GBRAM would
also work.
* CPU — Min. Quad core, with at least 1.80GHz
5. Hadoop
* I am using Hadoop-2.9.2, you can also use any other STABLEversion for
hadoop.
* If you don’t have Hadoop, you can refer installing it from Hadoop :
How to install in 5 Steps in Windows 10.
• Now we can organize our SQOOP installation, we can create afolder and move
the final extracted file in it. For Eg. :-
• I have placed my SQOOP in D: drive you can use C: or any otherdrive also.
To edit environment variables, go to Control Panel > System > click on the“Advanced system
settings” link
Alternatively, We can Right click on This PC icon and click on Propertiesand click on the
“Advanced system settings” link.
Or, easiest way is to search for Environment Variable in search bar and there youGO…😉
Note:- If you want the path to be set for all users you need to select“New” from System
Variables.
Configure SQOOP
If you have already installed MySQL Database or any other Database like
MySQL, PostgreSQL, Oracle, SQL Server and DB2 you can skip thisstep and
move ahead.
The next important step in configuring SQOOP is to create users for MySQL.
These Users are used for connecting SQOOP to MySQL Database for reading
and writing data from it.
Now Open the Administration option in the Workspace and select Users andprivileges
option under Management.
Akash Dabi
21100BTCSE09726
• Now select Add Account option and Create an new user with
LoginName as sqoop and Limit to Host Mapping as
the localhost and Password of your choice.
Akash Dabi
21100BTCSE09726
• Now we have to define the roles for this user under Administrative
• After clicking OK we need to select All the privileges for this schema.
• Click Apply and we are done with the creating SQOOP user.
Code
Akash Dabi
21100BTCSE09726
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
8.0.23.zip
cp mysql-connector-java-8.0.23/mysql-connector-java-8.0.23.jar
$SQOOP_HOME/lib/
o Verify connectivity to the MySQL database:
Code
sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root --
password <password>
3. Import Data from MySQL to HDFS
o Import a specific table from MySQL to HDFS:
Code
sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password
<password> --table employees --target-dir /user/<your-username>/employees
4. Export Data from HDFS to MySQL
o Export data from HDFS to MySQL:
Code
sqoop export --connect jdbc:mysql://localhost:3306/test --username root --password
<password> --table employees --export-dir /user/<your-username>/employees
5. Experiment Code
• Sqoop commands to import and export data between Hadoop and MySQL.
6. Execution
• Use Sqoop to import/export data between HDFS and a MySQL database.
7. Observations
• Sqoop simplifies the process of transferring large datasets between RDBMS and
Hadoop.
8. Analysis
• Sqoop is highly effective for ETL processes where data needs to be transferred in bulk
between traditional databases and Hadoop.
9. Conclusion
• Successfully installed and configured Sqoop, and performed data import/export
operations between HDFS and MySQL.
10. Viva Questions
1. What is the purpose of Apache Sqoop?
2. How do you import a table from MySQL to Hadoop using Sqoop?
3. What are some advantages of using Sqoop?
4. What is the difference between import and export in Sqoop?
5. How does Sqoop handle failures during data transfer?
11. Multiple Choice Questions (MCQs)
1. What is Apache Sqoop used for?
a) Data querying
b) Data transfer between Hadoop and RDBMS
c) Data visualization
d) Data storage
Answer: b) Data transfer between Hadoop and RDBMS
2. Which of the following commands is used to import data from MySQL to HDFS?
a) sqoop export
b) sqoop import
Akash Dabi
21100BTCSE09726
c) sqoop transfer
d) sqoop load
Answer: b) sqoop import
3. What does the --connect option in Sqoop do?
a) Specifies the connection string for the database
b) Specifies the input format for data
c) Specifies the output format for data
d) Specifies the username for authentication
Answer: a) Specifies the connection string for the database
4. What does Sqoop use to transfer data from RDBMS to HDFS?
a) JDBC
b) ODBC
c) RPC
d) REST API
Answer: a) JDBC
5. Sqoop is primarily used for:
a) Data querying
b) Data processing
c) Data transfer
d) Data visualization
Answer: c) Data transfer
12. References
1. Official Sqoop Documentation: https://sqoop.apache.org/
2. Hadoop: The Definitive Guide by Tom White
3. Sqoop Cookbook by Kathleen Ting, Jarek Jarcec Cecho
Akash Dabi
21100BTCSE09726
Experiment No.: 7
Experiment Title: HBase Installation
Date:
1. Objective
• To install and configure Apache HBase.
• To create and manage tables in HBase using the HBase shell.
2. Theory
• Apache HBase: A distributed, scalable, NoSQL database built on top of Hadoop. It is
modeled after Google’s BigTable and is used for storing large amounts of sparse data.
• HBase provides real-time read/write access to data in HDFS.
3. Requirements
• Apache HBase binary package
• Hadoop installed and configured
• Zookeeper installed and running
4. Procedure
1. Download and Install HBase
Prerequisite
• Install Java JDK - You can download it from
this link. (https://www.oracle.com/java/technologies/downloads/)
The Java Development Kit (JDK) is a cross-platform software development
environment that includes tools and libraries for creating Java-based software
applications and applets.
Steps
Step-1 (Extraction of files)
Extract all the files in C drive
Akash Dabi
21100BTCSE09726
<value>localhost</value>
</property>
2. Start HBase
o Start the HBase service:
Code
start-hbase.sh
Code
hbase shell
Code
create 'employees', 'personal_data', 'professional_data'
Code
put 'employees', 'row1', 'personal_data:name', 'John'
put 'employees', 'row1', 'professional_data:salary', '50000'
Code
get 'employees', 'row1'
5. Stop HBase
o To stop HBase, use the following command:
Code
stop-hbase.sh
5. Experiment Code
• HBase shell commands to create a table, insert data, and query the table.
6. Execution
• Use the HBase shell to create and manage a table, insert data, and perform basic
queries.
7. Observations
• HBase allows for efficient management of large-scale data with real-time read/write
access.
8. Analysis
• HBase is well-suited for use cases requiring real-time access to large datasets, such as
online applications and analytics platforms.
9. Conclusion
• Successfully installed and configured HBase, and managed data in an HBase table
using the HBase shell.
Akash Dabi
21100BTCSE09726
12. References
1. Official HBase Documentation: https://hbase.apache.org/
2. HBase: The Definitive Guide by Lars George
3. Hadoop: The Definitive Guide by Tom White
4. Installation Guide:https://www.naukri.com/code360/library/hbase-installation-on-
windows.
Akash Dabi
21100BTCSE09726
Experiment No.: 8
Experiment Title: Hadoop Streaming
Date:
1. Objective
• To understand Hadoop Streaming and its functionality in processing data using
custom MapReduce scripts in Python or other non-Java languages.
• To implement and run a Hadoop Streaming job using a Mapper and Reducer script
written in Python.
2. Theory
• Hadoop Streaming:
Hadoop Streaming is a utility provided by Hadoop to allow users to create and run
MapReduce jobs with any executable or script as the Mapper and Reducer. It enables
flexibility in leveraging non-Java programming languages like Python, Ruby, or Perl
for data processing.
• Key Components:
o Mapper: Processes each line of input and emits key-value pairs.
o Reducer: Aggregates the key-value pairs emitted by the Mapper and provides
the final output.
• Workflow:
1. Data is divided into splits and sent to Mappers.
2. Mappers process the data and emit key-value pairs.
3. Reducers aggregate the key-value pairs and output the final result.
• Use Cases:
o Word count analysis.
o Log file analysis.
o Custom data transformations and aggregations.
3. Requirements
• Software:
o Hadoop (latest stable version)
o Python (for writing Mapper and Reducer scripts)
• Hardware:
o Minimum 4 GB RAM
o A system or cluster with Hadoop installed
• Dataset:
A sample text file for word count analysis (e.g., sample.txt containing text data).
4. Procedure
1. Prepare Input Data
o Create a sample text file named sample.txt:
Hadoop is a framework for distributed storage and processing of large datasets.
Streaming allows flexibility for non-Java users.
Hadoop Streaming processes data using Mapper and Reducer scripts.
Python code
# mapper.py
import sys
Python code
# reducer.py
import sys
current_word = None
current_count = 0
word = None
-output /user/hadoop/streaming_output \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py"
5. Experiment Code
• Mapper Script:
Refer to mapper.py.
• Reducer Script:
Refer to reducer.py.
6. Execution
• Create the input file, upload it to HDFS, and run the streaming job.
• Verify the output by checking the files in the output directory.
7. Observations
• The job processes the text file, and the output displays the count of each word in the
dataset.
• The input data is divided and processed in parallel by Mappers, with Reducers
aggregating the results.
8. Analysis
• Hadoop Streaming allows non-Java developers to leverage Hadoop's distributed
processing power.
• The Mapper and Reducer scripts in Python process the data efficiently, showing
Hadoop's flexibility.
9. Conclusion
• Successfully implemented a Hadoop Streaming job using Python.
• Demonstrated the use of custom Mapper and Reducer scripts to process data.
12. References
1. Official Hadoop Documentation: Hadoop Streaming
2. "Hadoop: The Definitive Guide" by Tom White
3. TutorialsPoint Hadoop Streaming Guide: Streaming Basics
Akash Dabi
21100BTCSE09726
Experiment No.: 9
Experiment Title: Creating a Mapper Function Using Python
Date:
1. Objective
• To understand the role of the Mapper in the Hadoop MapReduce framework.
• To implement a Mapper function using Python for a word count application.
2. Theory
• Mapper in MapReduce:
The Mapper function is the first phase of the MapReduce framework. It processes
input data line-by-line and produces intermediate key-value pairs that are grouped by
the Hadoop framework before being passed to the Reducer.
• Working Principle:
1. Input data is split into chunks and provided to the Mapper.
2. Each line of input is processed, and the Mapper outputs key-value pairs.
3. Hadoop's shuffle and sort phase groups all values with the same key together
for the Reducer.
• Example Use Case:
In a word count problem, the Mapper emits words as keys and 1 as the value for each
occurrence.
3. Requirements
• Software:
o Hadoop (latest stable version)
o Python (for writing the Mapper script)
• Hardware:
o Minimum 4 GB RAM
o A system or cluster with Hadoop installed
• Dataset:
A sample text file for word count analysis (e.g., sample.txt containing text data).
4. Procedure
1. Prepare Input Data
o Create a sample text file named sample.txt:
Python is a powerful programming language.
Hadoop and Python can work together.
This is an example of Mapper functionality.
o Expected Output:
Python 1
Hadoop 1
Mapper 1
example. 1
5. Experiment Code
• Mapper Script:
Refer to mapper.py.
6. Execution
• Create the input file, upload it to HDFS, and run the Mapper using a Hadoop
Streaming job.
• Verify the output by checking the files in the output directory.
7. Observations
• The Mapper processes the input data and emits key-value pairs for each word.
• Output consists of words as keys and 1 as the value, indicating their occurrence in the
input dataset.
Akash Dabi
21100BTCSE09726
8. Analysis
• The Mapper function successfully processes input data line-by-line and emits key-
value pairs.
• This experiment highlights how Python can be used to develop custom Mapper
functions in Hadoop Streaming.
9. Conclusion
• Successfully created and executed a Mapper function in Python.
• Demonstrated the Mapper's role in Hadoop's MapReduce workflow.
12. References
1. Hadoop Streaming Documentation: Apache Hadoop Streaming
2. "Hadoop: The Definitive Guide" by Tom White
3. TutorialsPoint: Hadoop Streaming Basics
Akash Dabi
21100BTCSE09726
Experiment No.: 10
Experiment Title: Creating Reducer Function Using Python
Date:
1. Objective
• To understand the role of the Reducer in the Hadoop MapReduce framework.
• To implement a Reducer function using Python for a word count application.
2. Theory
• Reducer in MapReduce:
The Reducer function is the second phase of the MapReduce framework. It aggregates
and processes the grouped intermediate key-value pairs produced by the Mapper to
generate final output results.
• Working Principle:
1. The output from the Mapper (key-value pairs) is shuffled and sorted, grouping
all values associated with the same key together.
2. These grouped key-value pairs are provided to the Reducer.
3. The Reducer processes each group and generates a final result.
• Example Use Case:
In a word count problem, the Reducer aggregates the counts of each word (key) and
outputs the total count for each word.
3. Requirements
• Software:
o Hadoop (latest stable version)
o Python (for writing the Reducer script)
• Hardware:
o Minimum 4 GB RAM
o A system or cluster with Hadoop installed
• Dataset:
Intermediate output from a Mapper (e.g., word and count pairs).
4. Procedure
1. Prepare Intermediate Input Data
o Create a sample intermediate input file named mapper_output.txt to simulate
the output of a Mapper:
Python 1
Hadoop 1
Python 1
Example 1
Mapper 1
Hadoop 1
Python code
# reducer.py
import sys
Akash Dabi
21100BTCSE09726
current_word = None
current_count = 0
word = None
try:
count = int(count)
except ValueError:
continue # Skip lines with invalid counts
if current_word == word:
current_count += count
else:
if current_word:
# Write the result to standard output
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
o Expected Output:
Example 1
Hadoop 2
Mapper 1
Python 2
5. Experiment Code
• Reducer Script:
Refer to reducer.py.
6. Execution
• Run the MapReduce job, providing the mapper.py and reducer.py scripts.
• Verify the output in the HDFS output directory.
7. Observations
• The Reducer aggregates the counts for each word received from the Mapper.
• Final output contains words and their total counts in the dataset.
8. Analysis
• The Reducer function successfully processes grouped key-value pairs and generates
aggregated results.
• This experiment highlights how Python can be used to develop custom Reducer
functions in Hadoop Streaming.
9. Conclusion
• Successfully created and executed a Reducer function in Python.
• Demonstrated the Reducer's role in Hadoop's MapReduce framework.
12. References
1. Hadoop Streaming Documentation: Apache Hadoop Streaming
2. "Hadoop: The Definitive Guide" by Tom White
3. TutorialsPoint: Hadoop Reducer Basics
Akash Dabi
21100BTCSE09726
Experiment No.: 11
Experiment Title: Python Iterators and Generators
Date:
1. Objective
• To understand and implement Python iterators and generators.
• To explore the differences between iterators and generators in Python.
• To practice writing custom iterators and generators.
2. Theory
• Iterators: Objects in Python that implement the iter () and next () methods to
iterate over a collection of items, one at a time.
• Generators: A simpler way to create iterators using the yield keyword. A generator
function is a function that returns a generator object.
Python has a generator that allows you to create your iterator function. A generator is
somewhat of a function that returns an iterator object with a succession of values
rather than a single item. A yield statement, rather than a return statement, is used in a
generator function.
Akash Dabi
21100BTCSE09726
• Difference:
• Now, let's look at some distinctions between Iterators and Generators in Python:
Iterators Generators
A generator is a function that produces or
Iterators are the objects that use the next() method to yields a sequence of values using a yield
get the next value of the sequence. statement.
3. Requirements
• Python 3.x
• Text editor or IDE (e.g., VS Code, PyCharm)
Akash Dabi
21100BTCSE09726
4. Procedure
1. Implement a Custom Iterator
o Write a class to implement an iterator that produces a sequence of even
numbers:
Python Code
class EvenNumbers:
def init (self, max_number):
self.max = max_number
self.num = 0
2. Implement a Generator
o Write a generator function to yield squares of numbers:
Python Code
def square_numbers(max_number):
for i in range(max_number):
yield i * i
4. Memory Comparison
o Compare the memory usage between an iterator and a generator using
Python's sys.getsizeof() method to demonstrate the efficiency of generators.
5. Experiment Code
Python Code
class EvenNumbers:
def init (self, max_number):
self.max = max_number
self.num = 0
Akash Dabi
21100BTCSE09726
def square_numbers(max_number):
for i in range(max_number):
yield i * i
evens = EvenNumbers(10)
for num in evens:
print(num)
squares = square_numbers(5)
for square in squares:
print(square)
6. Execution
• Run the Code in a Python environment and observe the output of both the iterator and
generator.
7. Observations
• Iterators iterate through the sequence with each next () call.
• Generators yield values lazily, meaning they generate the value on-demand without
storing all values in memory at once.
8. Analysis
• Generators are more memory-efficient compared to iterators, especially for large
datasets because they yield values on-demand.
• Iterators are class-based and involve defining the iter () and next () methods
explicitly, whereas generators are function-based and easier to implement.
9. Conclusion
• Python iterators and generators are powerful tools for handling sequences of data.
• Generators provide an efficient way to handle large sequences by yielding values
lazily.
12. References
1. Official Python Documentation:
https://docs.python.org/3/tutorial/classes.html#iterators
2. Fluent Python by Luciano Ramalho
3. Python Cookbook by David Beazley and Brian K. Jones
4. https://www.naukri.com/code360/library/iterators-and-generators-in-python
Akash Dabi
21100BTCSE09726
Experiment No.: 12
Experiment Title: Twitter Data Sentimental Analysis Using Flume and Hive
Date:
1. Objective
• To collect Twitter data using Apache Flume.
• To store the collected data in HDFS.
• To analyze the Twitter data for sentiment analysis using Apache Hive.
2. Theory
• Apache Flume:
A distributed service designed to collect, aggregate, and transport large amounts of
log data into HDFS. Flume supports sources like Twitter streams through custom
configurations.
• Apache Hive:
A data warehousing tool that allows SQL-like querying on large datasets stored in
HDFS. Hive can process semi-structured and unstructured data using built-in
functions or UDFs.
• Sentiment Analysis:
The process of analyzing textual data to classify the sentiment as positive, negative, or
neutral. Twitter sentiment analysis helps gauge public opinion and trends.
3. Requirements
• Software:
o Apache Hadoop
o Apache Flume
o Apache Hive
o Python (for sentiment analysis scripting)
• Hardware:
o 8 GB RAM (recommended)
o Internet connection for accessing Twitter API
• Dataset:
Real-time Twitter data collected using Flume.
4. Procedure
1. Set Up Flume to Collect Twitter Data
• Install Flume:
sudo apt-get install flume-ng
Akash Dabi
21100BTCSE09726
• Configure Flume:
o Create a configuration file twitter.conf for Flume:
properties
twitterAgent.sources = twitterSource
twitterAgent.channels = memoryChannel
twitterAgent.sinks = hdfsSink
twitterAgent.sources.twitterSource.type =
org.apache.flume.source.twitter.TwitterSource
twitterAgent.sources.twitterSource.consumerKey =
<Your_Consumer_Key>
twitterAgent.sources.twitterSource.consumerSecret =
<Your_Consumer_Secret>
twitterAgent.sources.twitterSource.accessToken = <Your_Access_Token>
twitterAgent.sources.twitterSource.accessTokenSecret =
<Your_Access_Token_Secret>
twitterAgent.channels.memoryChannel.type = memory
twitterAgent.channels.memoryChannel.capacity = 1000
twitterAgent.sinks.hdfsSink.type = hdfs
twitterAgent.sinks.hdfsSink.hdfs.path =
hdfs://localhost:9000/user/hadoop/twitter_data/
twitterAgent.sinks.hdfsSink.hdfs.fileType = DataStream
twitterAgent.sinks.hdfsSink.hdfs.writeFormat = Text
twitterAgent.sinks.hdfsSink.hdfs.batchSize = 100
twitterAgent.sinks.hdfsSink.hdfs.rollSize = 0
twitterAgent.sinks.hdfsSink.hdfs.rollInterval = 600
twitterAgent.sources.twitterSource.channels = memoryChannel
twitterAgent.sinks.hdfsSink.channel = memoryChannel
Sql code
LOAD DATA INPATH '/user/hadoop/twitter_data/' INTO TABLE
twitter_data;
Sql code
SELECT tweet,
CASE
WHEN tweet LIKE '%good%' THEN 'Positive'
WHEN tweet LIKE '%bad%' THEN 'Negative'
ELSE 'Neutral'
END AS sentiment
FROM twitter_data;
3. Visualize Results
Export the analyzed data to a local file for visualization:
code
hive -e "INSERT OVERWRITE LOCAL DIRECTORY
'/local/sentiment_results/'
SELECT sentiment, COUNT(*) AS count
FROM twitter_data
GROUP BY sentiment;"
Use Python with Matplotlib to create a pie chart for sentiment distribution:
Python code
import matplotlib.pyplot as plt
# Example data
labels = ['Positive', 'Negative', 'Neutral']
sizes = [45, 25, 30] # Example values, replace with actual counts
colors = ['gold', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)
plt.axis('equal')
plt.title("Sentiment Analysis of Twitter Data")
plt.show()
Akash Dabi
21100BTCSE09726
5. Experiment Code
• Flume Configuration: twitter.conf.
• Hive Query: Sentiment analysis using SQL.
• Python Code: Sentiment distribution visualization.
6. Execution
1. Start Flume to collect Twitter data.
2. Load and query the data using Hive.
3. Visualize the sentiment distribution using Python.
7. Observations
• Flume successfully collected Twitter data into HDFS.
• Hive processed the data, extracting insights on public sentiment.
• Visualization provided a clear representation of sentiment distribution.
8. Analysis
• Flume integrates seamlessly with Hadoop for real-time data ingestion.
• Hive's SQL-like interface simplifies sentiment analysis on large datasets.
9. Conclusion
• Successfully conducted Twitter sentiment analysis using Flume and Hive.
• Demonstrated real-time data collection, processing, and visualization.
b) Flume
c) Sqoop
d) Pig
Answer: b) Flume
2. What type of data does Flume ingest into Hadoop?
a) Structured
b) Unstructured
c) Semi-structured
d) All of the above
Answer: d) All of the above
3. What is the primary function of Hive in the Hadoop ecosystem?
a) Data ingestion
b) SQL-like querying of large datasets
c) Distributed storage
d) Workflow scheduling
Answer: b) SQL-like querying of large datasets
4. What is the output of the Hive query in sentiment analysis?
a) Structured data
b) Key-value pairs
c) Aggregated sentiment labels
d) JSON data
Answer: c) Aggregated sentiment labels
5. Which Python library is used for visualizing sentiment distribution?
a) pandas
b) matplotlib
c) numpy
d) seaborn
Answer: b) matplotlib
12. References
1. Apache Flume Documentation
2. Apache Hive Documentation
3. "Mining the Social Web" by Matthew A. Russell
Akash Dabi
21100BTCSE09726
Experiment No.: 13
Experiment Title: Business Insights of User Usage Records of Data Cards
Date:
1. Objective
• To analyze the usage patterns of data cards by users.
• To derive business insights that can help improve customer experience and optimize
network usage.
2. Theory
• Big Data in Telecom: Telecom companies generate vast amounts of data through user
interactions, including data usage, call records, and location information. Analyzing
this data helps in improving services, detecting patterns, and creating personalized
user experiences.
• Data Cards: Portable modems or mobile hotspots that allow users to access the
internet over a cellular network. Tracking the data card usage patterns provides
insights into peak usage times, customer preferences, and network performance.
• Business Insights: By analyzing the data usage records, telecom companies can
identify trends in network usage, optimize resources, and provide tailored service
packages to customers.
3. Requirements
• Hadoop cluster (local or distributed)
• Hive or Pig for querying and analysis
• Data usage logs (containing details like user ID, timestamp, data usage, location, etc.)
4. Procedure
1. Collect Data Usage Records
o Gather the data usage records for data cards, containing fields like user ID,
timestamp, data usage (MB/GB), and location.
2. Store Data in HDFS
o Store the collected data in HDFS for distributed storage and processing.
3. Create Hive Tables
o Create tables in Hive to store and query the data:
Sql Code
CREATE EXTERNAL TABLE data_usage (
user_id STRING,
timestamp STRING,
data_used DOUBLE,
location STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive/data_usage';
Sql Code
Akash Dabi
21100BTCSE09726
6. Execution
• Execute the Hive queries to gather insights into user data card usage and network
performance.
7. Observations
• The total data usage per user provides insights into customer demand for data
services.
• Peak usage times indicate when the network experiences the most traffic.
• Data usage by location helps identify where network performance needs to be
optimized or expanded.
8. Analysis
• The results provide valuable business insights that help telecom companies improve
network services, offer targeted promotions, and manage network congestion during
peak hours.
9. Conclusion
• Successfully analyzed user usage records of data cards and derived business insights
related to peak usage times, high-demand users, and geographical data usage patterns.
12. References
1. Official Hive Documentation: https://cwiki.apache.org/confluence/display/Hive/Home
2. Big Data for Dummies by Judith Hurwitz, Alan Nugent, Fern Halper, and Marcia
Kaufman
3. Big Data Analytics with Hadoop by Bhushan Lakhe
Akash Dabi
21100BTCSE09726
Experiment No.: 14
Experiment Title: Wiki Page Ranking with Hadoop
Date:
1. Objective
• To perform Wiki page ranking using Hadoop and analyze the rank of pages.
• To implement Hadoop MapReduce for distributed processing of large-scale data.
• To understand how MapReduce can be used for ranking algorithms like PageRank.
2. Theory
• PageRank Algorithm: A ranking algorithm used to measure the importance of web
pages by considering the number and quality of links to a page.
• Hadoop Ecosystem: Enables distributed computation and storage for big data
analysis. Hadoop uses HDFS for storage and MapReduce for processing.
• MapReduce: A programming model for processing and generating large datasets with
a parallel, distributed algorithm. It consists of two steps:
1. Map Phase: Processes input data and generates intermediate key-value pairs.
2. Reduce Phase: Aggregates these intermediate values to produce the final
output.
3. Requirements
Software:
• Hadoop (latest stable version, e.g., Hadoop 3.x)
• Java Development Kit (JDK 8 or above)
Hardware:
• Minimum 4 GB RAM
• Multi-core processor
• Stable internet connection
4. Procedure
4.1 Setup and Data Preparation
1. Download and Install Hadoop:
o Follow the steps mentioned in Experiment No. 1 for setting up Hadoop in
pseudo-distributed mode.
2. Prepare Dataset:
o Use a sample dataset of Wiki pages with link relationships.
o Format: <source_page> <destination_page>
o Place the dataset in HDFS:
hdfs dfs -mkdir /wiki_data
hdfs dfs -put /path/to/wiki_dataset.txt /wiki_data
5. Experiment Code
• See the Mapper, Reducer, and Driver code snippets above.
6. Execution
1. Upload the dataset to HDFS.
2. Execute the MapReduce job.
3. Verify the output by checking the HDFS directory:
hdfs dfs -ls /output
7. Observations
• The output will contain pages with their updated ranks.
• Example output format:
PageA 0.85
PageB 0.57
8. Analysis
• The PageRank values converge after several iterations of the MapReduce job.
• Pages with high inbound links and high-ranked incoming links have higher PageRank
scores.
9. Conclusion
• Successfully implemented Wiki Page Ranking using Hadoop MapReduce.
• Analyzed the importance of pages using the PageRank algorithm.
12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White
• https://xebia.com/blog/wiki-pagerank-with-hadoop/
Akash Dabi
21100BTCSE09726
Experiment No.: 15
Experiment Title: Health Care Data Management using Apache Hadoop
Ecosystem.
Date:
1. Objective:
• To explore how Apache Hadoop can be used to manage large-scale health care data
efficiently.
• To perform data processing on healthcare datasets using Hadoop tools such as HDFS,
MapReduce, Hive, and Pig.
• To perform basic data analysis, filtering, and aggregation tasks on health care data in
the Hadoop ecosystem.
2. Theory:
• Apache Hadoop: Hadoop is an open-source framework that allows for the distributed
storage and processing of large datasets. It uses HDFS (Hadoop Distributed File
System) for storage and MapReduce for processing data in a distributed manner.
• Components of the Hadoop Ecosystem:
o HDFS: A distributed file system that stores large datasets across multiple
nodes in a cluster.
o MapReduce: A programming model that enables distributed processing of
large data sets by splitting the data into smaller tasks.
o Hive: A data warehouse system that provides SQL-like queries for managing
large datasets stored in Hadoop.
o Pig: A high-level platform for processing and analyzing large datasets, using a
language called Pig Latin.
• Health Care Data: Health care data is often large, complex, and unstructured. It can
include patient information, medical records, diagnostic data, treatment data, etc.
Managing and analyzing such data can improve healthcare outcomes, optimize
processes, and support research.
• Example Data Sources:
o Patient demographic data (age, gender, location).
o Medical records (diagnosis, treatment plans).
o Health care provider data (doctors, hospitals, clinics).
o Treatment costs, insurance data, and more.
3. Requirements:
Software:
• Java Development Kit (JDK 8 or above)
• Apache Hadoop (latest stable version, e.g., Hadoop 3.x)
• Apache Hive
• Apache Pig (optional)
Hardware:
• Minimum 4 GB RAM
• Stable internet connection for downloading Hadoop packages
• At least 20 GB of free disk space for setting up Hadoop and storing the dataset
4. Procedure:
Step 1: Download and Install Hadoop
1. Download the latest version of Hadoop from the official Apache Hadoop website.
Akash Dabi
21100BTCSE09726
2. Edit hdfs-site.xml to configure the replication factor and the location of the data
node:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
2. Using Hive:
1. Create a Hive table to represent the healthcare data stored in HDFS.
Sql code
2. Perform queries on the data to get insights, such as the count of patients for each
diagnosis.
Sql code
SELECT diagnosis, COUNT(*) FROM healthcare_data GROUP BY diagnosis;
3. Using Pig (optional):
1. Load the data into Pig.
Pig code
healthcare_data = LOAD '/healthcare_data' USING PigStorage(',') AS
(patient_id:int, age:int, gender:chararray, diagnosis:chararray,
treatment:chararray);
2. Perform transformations and analysis using Pig Latin, such as filtering patients with a
specific diagnosis.
Pig code
patients_with_condition = FILTER healthcare_data BY diagnosis ==
'Hypertension';
Pig code
STORE patients_with_condition INTO '/output';
5. Experiment Code:
MapReduce Example:
Java code
// Mapper Class
public class HealthcareMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private Text diagnosis = new Text();
private IntWritable age = new IntWritable();
// Reducer Class
public class HealthcareReducer extends Reducer<Text, IntWritable, Text,
DoubleWritable> {
private DoubleWritable result = new DoubleWritable();
6. Execution:
• After setting up Hadoop, upload the healthcare dataset to HDFS, and execute the
MapReduce job, Hive queries, or Pig scripts to perform data analysis.
7. Observations:
• The output of the MapReduce job would be the average age of patients grouped by
medical conditions.
• The Hive query will provide counts of patients diagnosed with various conditions.
• Pig scripts can be used for more complex transformations and data filtering.
8. Analysis:
• Using Hadoop, we can efficiently manage and analyze large-scale health care data,
which can provide valuable insights for healthcare providers, researchers, and policy
makers.
• The ecosystem tools (MapReduce, Hive, Pig) provide different levels of abstraction
and performance optimizations for handling health care data.
9. Conclusion:
• Hadoop successfully enabled distributed storage and processing of large health care
datasets.
• We analyzed patient data for various metrics (e.g., average age by condition) using
MapReduce, Hive, and Pig.
• The experiment demonstrates the power of the Hadoop ecosystem in managing big
data in the healthcare industry.
12. References:
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Apache Hive Documentation:
https://cwiki.apache.org/confluence/display/Hive/Home
• Apache Pig Documentation: https://pig.apache.org/docs/r0.17.0/
• Big Data in Healthcare by Rashmi Bansal.