Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HadoopfilePP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 83

Akash Dabi 21100BTCSE09726

Experiment No.: 1
Experiment Title: Installing Hadoop, configure HDFS, Configuring Hadoop.
Date:

1. Objective
• To download, install, and configure Hadoop on a local machine or server.
• To understand different Hadoop operational modes (Standalone, Pseudo-Distributed, Fully
Distributed).
• To explore Hadoop’s start-up scripts and configuration files.

2. Theory
• Hadoop is an open-source framework that enables distributed storage and processing of large
datasets using simple programming models. It is designed to scale up from a single server to
thousands of machines.
• Hadoop Modes:
o Standalone Mode: The simplest mode, where Hadoop runs as a single Java process.
o Pseudo-Distributed Mode: Runs on a single node but mimics a fully distributed
environment by configuring Hadoop to treat the local filesystem as HDFS.
o Fully Distributed Mode: A true multi-node setup, where data is distributed across several
machines, offering full scalability and fault tolerance.
• Key Configuration Files:
o core-site.xml: Configures the HDFS and YARN address.
o hdfs-site.xml: Configures HDFS replication and storage directories.
o mapred-site.xml: Configures MapReduce framework settings.
o yarn-site.xml: Configures YARN resource management settings.

3. Requirements
• Software:
o Java Development Kit (JDK 8 or above)
o Hadoop (latest stable version, e.g., Hadoop 3.x)
• Hardware:
o Minimum 4 GB RAM
o Stable internet connection for downloading Hadoop packages

4. Procedure
1. Download Hadoop:
o Visit the Apache Hadoop official website: https://hadoop.apache.org/.
o Download the latest stable release of Hadoop (e.g., Hadoop 3.x).
2. Install Hadoop:
o Extract the downloaded Hadoop package to a desired directory (e.g., /usr/local/hadoop).
o Set environment variables in the . rc file:

Code
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
o Source the . rc file to apply the changes:

Code
source ~/. rc
Akash Dabi 21100BTCSE09726

Pre requisites:-
These software should be prepared to install Hadoop 2.9.2 on window 11 64bit
1. Download Hadoop 2.9.2
(Link:http://wwweu.apache.org/dist/hadoop/common/hadoop2.9.2/hadoop
-2.9.2.tar.gz

http://archive.apache.org/dist/hadoop/core//hadoop-2.9.2/hadoop-
2.9.2.tar.gz)

2. Java JDK 1.8.0.zip


(Link: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-
2133151.html)
Set up:

1) Check either Java 1.8.0 is already installed on your system or not, use "Javac -version" to
check.

Figure 2.1. Checking Java Version

2) If Java is not installed on your system then first install java under "C:\JAVA"
3) Extract file Hadoop 2.9.2.tar.gz or Hadoop-2.9.2.zip and place
under "C:\Hadoop-2.9.2".

Figure 2.2. Installation of Hadoop


Akash Dabi 21100BTCSE09726

4) Set the path HADOOP_HOME Environment variable on windows 11(see Step1,2,3 and 4
below).

Figure 2.3. Setting Hadoop Environment Path

5) Set the path JAVA_HOME Environment variable on windows 11(see Step1,2,3 and 4
below).

Figure 2.4. Setting Java Environment Path

6) Next we set the Hadoop bin directory path and JAVA bin directory path.
Akash Dabi 21100BTCSE09726

Figure 2.5. Setting Hadoop bin Directory Path

Configuration of Hadoop
1) Edit file C:/Hadoop-2.9.2/etc/hadoop/core-site.xml, paste below xml paragraphand save this
file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

2) Rename "mapred-site.xml.template" to "mapred-site.xml" and edit this file C:/Hadoop-


2.9.2/etc/hadoop/mapred-site.xml, paste below xml paragraph andsave this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3) Create folder "data" under "C:\Hadoop-2.9.2"
● Create folder "datanode" under "C:\Hadoop-2.9.2\data"

● Create folder "namenode" under "C:\Hadoop-2.9.2\data


Akash Dabi 21100BTCSE09726

Figure 2.6. Creating folders in Hadoop

Configuration of HDFS
4) Edit file C:\Hadoop-2.9.2/etc/hadoop/hdfs-site.xml, paste below xml paragraphand save this
file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.9.2\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.9.2\data\datanode</value>
</property>
</configuration>

5) Edit file C:/Hadoop-2.9.2/etc/hadoop/yarn-site.xml, paste below xml paragraphand save this


file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
Akash Dabi 21100BTCSE09726

</property>
</configuration>

Hadoop Configuration
1. Dowload file Hadoop Configuration.zip
2. Delete file bin on C:\Hadoop-2.9.2\bin, replaced by file bin on file justdownload
(from Hadoop Configuration.zip).
3. Open cmd and typing command "hdfs namenode –format" . You will see.

Figure 2.7. Hadoop Configuration

Testing
1. Open cmd and change directory to "C:\Hadoop-2.9.2\sbin" and type "start-all.cmd" to
start apache.

Figure 2.8. Testing Hadoop

2. Make sure these apps are running.


● Hadoop Namenode
● Hadoop datanode
● YARN Resourc Manager
● YARN Node Manager
Akash Dabi 21100BTCSE09726

Figure 2.9. Checking files running or not

3. Open: http://localhost:8088

Figure 2.10. Localhost:8080

4. Open: http://localhost:9870
Akash Dabi 21100BTCSE09726

Figure 2.11. Localhost:9870

So, now you have successfully installed Hadoop.


3. Verify Hadoop Installation:
o Access the Hadoop NameNode UI by visiting http://localhost:9870 in a web browser.
o Access the ResourceManager UI by visiting http://localhost:8088.

5. Experiment Code
• N/A (This experiment is focused on configuration rather than coding).

6. Execution
• After configuring the files, format the namenode and start the Hadoop services.
• Verify successful installation by accessing the HDFS and YARN UIs on the specified URLs.

7. Observations
• The NameNode UI at http://localhost:9870 should display Hadoop filesystem information.
• The ResourceManager UI at http://localhost:8088 should show the status of YARN resources and
applications.

8. Analysis
• In this setup, Hadoop is configured to run in pseudo-distributed mode, simulating a distributed
environment on a single machine.
• This setup provides insights into how data is stored, managed, and processed in HDFS and how
resources are allocated in YARN.

9. Conclusion
• Successfully installed and configured Hadoop and Hdfs in pseudo-distributed mode.
• Explored key configuration files and verified setup through Hadoop’s web UIs.

10. Viva Questions


1. What are the different operational modes in Hadoop?
2. What is the purpose of core-site.xml and hdfs-site.xml configuration files?
3. Why is it necessary to format the Namenode?
4. How does the ResourceManager contribute to the Hadoop ecosystem?
Akash Dabi 21100BTCSE09726

11. Multiple Choice Questions (MCQs)


1. What is the default file system configured in Hadoop?
a) NFS
b) HDFS
c) FTP
d) Local File System
Answer: b) HDFS
2. Which file is used to configure the Hadoop NameNode URI?
a) mapred-site.xml
b) yarn-site.xml
c) core-site.xml
d) hdfs-site.xml
Answer: c) core-site.xml
3. In which mode does Hadoop run on a single machine but simulates a fully distributed
environment?
a) Standalone Mode
b) Pseudo-Distributed Mode
c) Fully Distributed Mode
d) Cluster Mode
Answer: b) Pseudo-Distributed Mode
4. Which of the following is the default port for the Hadoop NameNode web UI?
a) 8080
b) 9870
c) 9000
d) 50070
Answer: b) 9870
5. What does the start-dfs.sh command do in Hadoop?
a) Starts Hadoop MapReduce jobs
b) Starts Hadoop's YARN ResourceManager
c) Starts Hadoop’s HDFS services (NameNode, DataNode)
d) Starts Hadoop’s job tracker
Answer: c) Starts Hadoop’s HDFS services (NameNode, DataNode)

12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White
Akash Dabi 21100BTCSE09726

Experiment No.: 2
Experiment Title: Working on HDFS
Date:

1. Objective
• To explore basic file operations in Hadoop Distributed File System (HDFS).
• To perform tasks such as creating directories, uploading/downloading files, and managing files in
HDFS.
• To understand the architecture of HDFS and how it handles file storage and replication.

2. Theory
• HDFS (Hadoop Distributed File System): A distributed storage system in Hadoop designed to
store large datasets across multiple machines, ensuring fault tolerance and scalability.
• Replication: HDFS automatically replicates file blocks across different machines to ensure data
availability in case of node failure.

• HDFS Commands:
o hdfs dfs -mkdir <path>: Create a directory in HDFS.
o hdfs dfs -put <local_file> <hdfs_path>: Upload a file from the local system to HDFS.
o hdfs dfs -get <hdfs_file> <local_path>: Download a file from HDFS to the local system.
o hdfs dfs -ls <path>: List files and directories in HDFS.
o hdfs dfs -rm <path>: Delete files/directories in HDFS.

3. Requirements
• Software:
o Installed Hadoop in Pseudo-Distributed Mode
o Java Development Kit (JDK 8 or above)
• Hardware:
o Minimum 4 GB RAM

4. Procedure
1. Start Hadoop Services:
o Open a terminal and start the Hadoop services:
Code
start-dfs.sh
start-yarn.sh
Akash Dabi 21100BTCSE09726

Create a Directory in HDFS:


o Create a directory in HDFS named /user/<your-username>/input by executing the
following command:
Code
hdfs dfs -mkdir /user/<your-username>/input

2. Upload a File to HDFS:


o Upload a text file (e.g., sample.txt) from the local file system to the directory created in
HDFS:

Code
hdfs dfs -put sample.txt /user/<your-username>/input

3. List Files in HDFS:


o List the contents of the /user/<your-username>/input directory to verify the file upload:

Code
hdfs dfs -ls /user/<your-username>/input

4. Download a File from HDFS:


o Download the file sample.txt back to the local file system:
Code
hdfs dfs -get /user/<your-username>/input/sample.txt

5. Delete a File from HDFS:


o Remove the file sample.txt from HDFS:
Code
hdfs dfs -rm /user/<your-username>/input/sample.txt

6. Stop Hadoop Services:


o Once the file operations are complete, stop the Hadoop services:
Code
stop-dfs.sh
stop-yarn.sh

5. Experiment Code
• N/A (This experiment involves running HDFS commands, not writing Code).

6. Execution
• Execute the above HDFS commands one by one in the terminal and observe the output.
• Ensure that the directories are created, files are uploaded/downloaded successfully, and files are
deleted as required.

7. Observations
• The ls command in HDFS shows the file structure, similar to the local file system.
• Files uploaded to HDFS are divided into blocks and replicated according to the replication factor.
• The default replication factor is 3, meaning each file block is stored in three different locations for
fault tolerance.
8. Analysis
• HDFS simplifies managing large datasets by providing automatic file replication and fault
tolerance.
Akash Dabi 21100BTCSE09726

• File operations in HDFS are similar to Unix/Linux file commands, making it easier to learn for
users familiar with these systems.
• The ability to handle large files across multiple nodes without user intervention is a key feature of
HDFS.

9. Conclusion
• Successfully explored basic HDFS file operations such as creating directories,
uploading/downloading files, and deleting files.
• HDFS provides a reliable distributed file system that ensures data availability and fault tolerance.

10. Viva Questions


1. What is the default replication factor in HDFS?
2. How does HDFS ensure fault tolerance?
3. What is the command to create a directory in HDFS?
4. Why is it necessary to delete files from HDFS manually?

11. Multiple Choice Questions (MCQs)


1. What command is used to upload files to HDFS?
a) hdfs dfs -copy
b) hdfs dfs -put
c) hdfs dfs -get
d) hdfs dfs -remove
Answer: b) hdfs dfs -put
2. Which of the following is not an HDFS operation command?
a) hdfs dfs -ls
b) hdfs dfs -mkdir
c) hdfs dfs -del
d) hdfs dfs -rm
Answer: c) hdfs dfs -del
3. What is the default block size in HDFS?
a) 32 MB
b) 64 MB
c) 128 MB
d) 256 MB
Answer: c) 128 MB
4. Which of the following is used to list the contents of a directory in HDFS?
a) hdfs dfs -put
b) hdfs dfs -ls
c) hdfs dfs -mv
d) hdfs dfs -chmod
Answer: b) hdfs dfs -ls
5. What is the full form of HDFS?
a) Hadoop Distributed File System
b) Hadoop Data File Storage
c) Hadoop Distributed Framework Storage
d) Hadoop Default File System
Answer: a) Hadoop Distributed File System

12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/stable/hadoop-project-
dist/hadoop-hdfs/HdfsUserGuide.html
• Hadoop: The Definitive Guide by Tom White
Akash Dabi 21100BTCSE09726

Experiment No.: 3
Experiment Title: Running Jobs on Hadoop
Date:

1. Objective
• To execute a MapReduce job in Hadoop for parallel data processing.
• To understand how Hadoop splits and processes large datasets using the MapReduce
programming model.
• To learn how to submit a job, monitor it, and verify its output on Hadoop’s distributed file system
(HDFS).

2. Theory
• MapReduce is a programming paradigm that allows for massive scalability across hundreds or
thousands of servers in a Hadoop cluster. It breaks down data processing tasks into smaller sub-
tasks that can be executed in parallel.
o Map Phase: The input dataset is split into chunks (or blocks). Each chunk is processed in
parallel by a mapper, which transforms the input into intermediate key-value pairs.
o Reduce Phase: The output from the Map phase is grouped by keys and passed to reducers,
which consolidate the intermediate outputs into the final results.
• Input Splits: Hadoop divides input data into blocks (usually 64MB or 128MB) for processing by
map tasks.
• JobTracker & TaskTracker: JobTracker manages MapReduce jobs, and TaskTrackers execute
tasks in the cluster. In YARN (introduced in Hadoop 2.x), the ResourceManager and
NodeManager manage resources and tasks.

3. Requirements
• Software:
o Java Development Kit (JDK 8 or above)
o Hadoop (version 2.x or 3.x) in pseudo-distributed mode
o Sample input text file (e.g., input.txt)
• Hardware:
o Minimum 4 GB RAM
o 20 GB free disk space

4. Procedure
1. Create a Sample Input File
o Open a text editor and create a simple text file (input.txt) containing sample data. For
instance:

Code
Hadoop is an open-source framework
Hadoop enables distributed data processing
Hadoop runs MapReduce jobs
2. Upload Data to HDFS
o Start the Hadoop services (HDFS and YARN).
Code
start-dfs.sh
start-yarn.sh

o Copy the sample input.txt file to HDFS.


Code
hdfs dfs -mkdir /user/<your-username>/input
hdfs dfs -put input.txt /user/<your-username>/input
Akash Dabi 21100BTCSE09726

3. Write Mapper and Reducer Classes (Java)


o Mapper: This class processes the input data and generates key-value pairs.
java
Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String[] words = value.toString().split("\\s+");
for (String str : words) {
word.set(str);
context.write(word, one);
}
}
}

o Reducer: This class aggregates the mapper output and sums the values for each unique
key (word).
java
Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

o Driver: This class sets up and configures the MapReduce job.

Java Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
Akash Dabi 21100BTCSE09726

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

4. Compile the Code


o Compile the Java Code using Hadoop’s classpath.

Code
javac -classpath `hadoop classpath` -d . WordCountMapper.java WordCountReducer.java
WordCount.java
jar -cvf wordcount.jar *.class

5. Run the MapReduce Job


o Submit the MapReduce job by specifying the input directory and output directory in
HDFS:
Code
hadoop jar wordcount.jar WordCount /user/<your-username>/input /user/<your-
username>/output

6. Check the Output


o Once the job is completed, view the output generated by the Reducer in HDFS:

Code
hdfs dfs -cat /user/<your-username>/output/part-r-00000

5. Experiment Code
• Full Java source Code for the Mapper, Reducer, and Driver classes has been provided in Step 3
above.
6. Execution
• Submit the job using the Hadoop command, and observe the results in the specified HDFS output
directory.
• Ensure the Hadoop services (HDFS and YARN) are running, and monitor job progress using the
Hadoop web UI at:
o ResourceManager: http://localhost:8088
o HDFS NameNode UI: http://localhost:9870
Akash Dabi 21100BTCSE09726

7. Observations
• The input text file is split into smaller chunks, and multiple mappers process the data in parallel.
• The Reducer aggregates the results to produce the final word count output.

8. Analysis
• MapReduce provides a powerful mechanism to process and analyze large datasets in a distributed
and fault-tolerant manner.
• The job is divided into multiple tasks (map and reduce), and these tasks are executed across the
cluster in parallel, thus significantly improving processing time for large-scale data.

9. Conclusion
• Successfully executed a MapReduce job in Hadoop.
• The results of the job were verified by analyzing the output generated in HDFS.
• The experiment provided insights into the functionality and scalability of MapReduce for
distributed data processing.

10. Viva Questions


1. What are the roles of Mapper and Reducer in Hadoop?
2. Explain the difference between the Map phase and the Reduce phase in a MapReduce job.
3. Why is HDFS used for storing data in Hadoop?
4. How does Hadoop handle the failure of a map or reduce task during a job execution?
5. What is the significance of the hdfs dfs -put command?

11. Multiple Choice Questions (MCQs)


1. In Hadoop’s MapReduce framework, the intermediate data is stored as:
a) Text files
b) Key-value pairs
c) XML files
d) Binary files
Answer: b) Key-value pairs
2. Which of the following is responsible for resource management in Hadoop’s YARN architecture?
a) TaskTracker
b) JobTracker
c) ResourceManager
d) DataNode
Answer: c) ResourceManager
3. In Hadoop, what does the FileInputFormat class do?
a) It formats the HDFS file system
b) It defines the input format for MapReduce jobs
c) It processes input files as whole blocks
d) It specifies the output format for MapReduce jobs
Answer: b) It defines the input format for MapReduce jobs
4. What is the default replication factor in HDFS?
a) 2
b) 3
c) 4
d) 5
Answer: b) 3
5. The Reducer’s output is stored in:
a) Local filesystem
b) HDFS
c) Memory
d) A relational database
Answer: b) HDFS
Akash Dabi 21100BTCSE09726

12. References
1. Hadoop: The Definitive Guide by Tom White
2. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and
Other Systems by Donald Miner, Adam Shook
3. Official Apache Hadoop Documentation: https://hadoop.apache.org/
Akash Dabi 21100BTCSE09726

Experiment No.: 4
Experiment Title: Install Zookeeper
Date:

1. Objective
• To install and configure Apache Zookeeper.
• To understand Zookeeper’s role in distributed coordination.

2. Theory
• Apache Zookeeper: A centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. It is a key component in
Hadoop, especially in managing services like HBase, Kafka, and others.
• Zookeeper ensures distributed coordination by managing shared resources and providing
mechanisms like leader election, locks, and queues.

3. Requirements
• Zookeeper binary package
• Java Development Kit (JDK 8 or above)

4. Procedure
1. Download and Install Zookeeper
o Download the latest version of Zookeeper from the official Zookeeper site.
o Extract the archive and move to the Zookeeper directory.
2. Configure Zookeeper
o Copy the sample configuration file:
Code
cp conf/zoo_sample.cfg conf/zoo.cfg
o Edit zoo.cfg and set the data directory:

Code
dataDir=/path/to/zookeeper/data

3. Start Zookeeper
o Start the Zookeeper server using the following command:
Code
bin/zkServer.sh start

4. Verify the Installation


o To check the status of the Zookeeper server:
Code
bin/zkServer.sh status

5. Experiment Code
• Configuration steps and Zookeeper commands (start, stop, status).

6. Execution
• Start the Zookeeper service and use the Zookeeper CLI to connect to the server.

7. Observations
• The server status should show that Zookeeper is running correctly, and you should be able to
interact with it using the CLI.
Akash Dabi 21100BTCSE09726

8. Analysis
• Zookeeper plays a crucial role in maintaining the coordination between distributed applications
and ensures data consistency.

9. Conclusion
• Successfully installed and configured Zookeeper, understanding its role in distributed
coordination.

10. Viva Questions


1. What are the main use cases of Zookeeper?
2. How does Zookeeper handle leader election?
3. Why is Zookeeper used in Hadoop ecosystems like HBase and Kafka?
4. What is a ZNode in Zookeeper?
5. Explain Zookeeper's role in maintaining distributed locks.

11. Multiple Choice Questions (MCQs)


1. What is the purpose of Zookeeper?
a) Data storage
b) Distributed coordination
c) Data replication
d) Load balancing
Answer: b) Distributed coordination
2. Which of the following is a service provided by Zookeeper?
a) Distributed synchronization
b) Data storage
c) Query optimization
d) Data visualization
Answer: a) Distributed synchronization
3. How does Zookeeper handle failures in a distributed system?
a) Through distributed synchronization
b) By leader election
c) By replication
d) All of the above
Answer: d) All of the above
4. What is the default port on which Zookeeper listens?
a) 8080
b) 2181
c) 9092
d) 3306
Answer: b) 2181
5. What is a ZNode?
a) A Zookeeper client
b) A node in the Zookeeper hierarchy
c) A replication node in Zookeeper
d) A log entry
Answer: b) A node in the Zookeeper hierarchy

12. References
1. Official Zookeeper Documentation: https://zookeeper.apache.org/
2. Hadoop Operations by Eric Sammer
3. ZooKeeper: Distributed Process Coordination by Flavio Junqueira, Benjamin Reed
Akash Dabi 21100BTCSE09726

Experiment No.: 5
Experiment Title: Pig Installation
Date:

1. Objective
• To install and configure Apache Pig.
• To execute basic Pig Latin scripts for data processing.

2. Theory
• Apache Pig: A platform for analyzing large datasets that uses a high-level language called Pig
Latin. It simplifies data processing by abstracting the complexity of MapReduce.
• Pig operates on data flow and is useful for ETL (Extract, Transform, Load) processes.

3. Requirements
• Apache Pig binary package
• Hadoop installed and configured
• Sample data file

4. Procedure
1. Download and Install Pig

1. Prerequisites
1.1. Hadoop Cluster Installation
Apache Pig is a platform build on the top of Hadoop. You can refer to our previously
published article to install a Hadoop single node cluster on Windows10.
1.2. 7zip
7zip is needed to extract .tar.gz archives we will be downloading in this guide.
2. Downloading Apache Pig
To download the Apache Pig, you should go to the following link:
• https://downloads.apache.org/pig/

3. Setting Environment Variables


After extracting Derby and Hive archives, we should go to Control Panel >System and
Security > System. Then Click on “Advanced system settings”.
Akash Dabi 21100BTCSE09726

In the advanced system settings dialog, click on “Environment variables” button.


Akash Dabi 21100BTCSE09726

Now we should add the following user variables:

• PIG_HOME: “E:\hadoop-env\pig-0.17.0”

Now, we should edit the Path user variable to add the following paths:

• %PIG_HOME%\bin.
Akash Dabi 21100BTCSE09726

4. Starting Apache Pig


After setting environment variables, let's try to run Apache Pig.

Open a command prompt as administrator, and execute the following command pig -version

You will receive the following exception:

To fix this error, we should edit the pig.cmd file located in the “pig-0.17.0\bin”directory by
changing the HADOOP_BIN_PATH value from “%HADOOP_HOME%\bin” to
“%HADOOP_HOME%\libexec”.
Akash Dabi 21100BTCSE09726

Now, let's try to run the “pig -version” command again:

The simplest way to write PigLatin statements is using Grunt shell which is an interactive
tool where we write a statement and get the desired output. There aretwo modes to involve
Grunt Shell:
1. Local: All scripts are executed on a single machine without requiringHadoop.
(command: pig -x local)

2. MapReduce: Scripts are executed on a Hadoop cluster (command: pig -x


MapReduce)
Since we have installed Apache Hadoop 3.2.1 which is not compatible with Pig0.17.0, we will
try to run Pig using local mode.

2. Write a Pig Script


o Create a Pig script (wordcount.pig) to count the occurrence of words in a file:

pig
Code
-- Load the data
data = LOAD '/user/<your-username>/input.txt' USING PigStorage(' ') AS
(word:chararray);

-- Group the data by word


grouped_data = GROUP data BY word;

-- Count the occurrences of each word


word_count = FOREACH grouped_data GENERATE group, COUNT(data);

-- Store the result


STORE word_count INTO '/user/<your-username>/output' USING PigStorage(',');
Akash Dabi 21100BTCSE09726

3. Run the Pig Script


o Use the Pig interactive shell (Grunt) or submit the script from the terminal:

Code
pig wordcount.pig

4. Verify Output
o Check the output in the HDFS directory:

Code
hdfs dfs -cat /user/<your-username>/output/part-r-00000

5. Experiment Code
• Pig script to count the occurrence of words in an input file.

5. Execution
• Run the Pig script and verify that the word counts are correctly stored in the output directory.

7. Observations
• The Pig script simplifies data processing by abstracting MapReduce complexities.

8. Analysis
• Apache Pig is a highly efficient tool for processing large-scale datasets and can be used in ETL
processes.

9. Conclusion
• Successfully installed and configured Pig, and executed a Pig Latin script for word count.

10. Viva Questions


1. What is Pig Latin?
2. How does Pig differ from MapReduce?
3. What are some advantages of using Pig in data processing?
4. Explain the LOAD and STORE operations in Pig.
5. What are UDFs (User Defined Functions) in Pig?

11. Multiple Choice Questions (MCQs)


1. Pig scripts are written in which language?
a) Java
b) Pig Latin
c) SQL
d) Python
Answer: b) Pig Latin
2. What does PigStorage(' ') do in Pig?
a) Loads data as key-value pairs
b) Stores data as key-value pairs
c) Loads data delimited by spaces
d) Loads data as a CSV file
Answer: c) Loads data delimited by spaces
3. What is the primary use case of Apache Pig?
a) Data querying
b) Data processing and ETL
c) Data replication
d) Load balancing
Answer: b) Data processing and ETL
Akash Dabi 21100BTCSE09726

4. Which of the following is a feature of Pig?


a) Ability to process unstructured data
b) Built-in UDFs for custom processing
c) Supports parallel execution
d) All of the above
Answer: d) All of the above

12. References
1. Official Pig Documentation: https://pig.apache.org/
2. Programming Pig by Alan Gates
3. Hadoop: The Definitive Guide by Tom White
Akash Dabi 21100BTCSE09726

Experiment No.: 6
Experiment Title: Sqoop Installation
Date:

1. Objective
• To install and configure Apache Sqoop for data transfer between Hadoop and relational databases.

2. Theory
• Apache Sqoop: A tool designed for efficiently transferring bulk data between Hadoop and
relational databases such as MySQL, Oracle, or PostgreSQL.
• Sqoop can be used to import data from RDBMS into HDFS or export data from HDFS to
RDBMS.

3. Requirements
• Apache Sqoop binary package
• Relational database (e.g., MySQL)
• Hadoop installed and configured

4. Procedure
1. Download and Install Sqoop

Prerequisites:

1. Hardware Requirement
* RAM — Min. 8GB, if you have SSD in your system then 4GBRAM would
also work.
* CPU — Min. Quad core, with at least 1.80GHz

2. JRE 1.8 — Offline installer for JRE


3. Java Development Kit — 1.8
4. A Software for Un-Zipping like 7Zip or Win Rar
* I will be using a 64-bit windows for the process, please checkand download the
version supported by your system x86 or x64for all the software.

5. Hadoop
* I am using Hadoop-2.9.2, you can also use any other STABLEversion for
hadoop.
* If you don’t have Hadoop, you can refer installing it from Hadoop :
How to install in 5 Steps in Windows 10.

6. MySQL Query Browser


Akash Dabi 21100BTCSE09726

7. Download SQOOP zip


* I am using SQOOP-1.4.7, you can also use any other STABLEversion for
SQOOP.

Fig 1:- Download Sqoop 1.4.7


2. Unzip and Install SQOOPAfter Downloading the SQOOP, we need to Unzip the sqoop-1.4.7.bn
hadoop-2.6.0.tar.gz file.

Fig 2:- Extracting Sqoop Step-1

Once extracted, we would get a new file sqoop-1.4.7.bin hadoop-


2.6.0.tar
Now, once again we need to extract this tar file.
Akash Dabi 21100BTCSE09726

Fig 3:- Extracting SQOOP Step-2

• Now we can organize our SQOOP installation, we can create afolder and move
the final extracted file in it. For Eg. :-

Fig 4:- SQOOP Directory

• Please note while creating folders, DO NOT ADD SPACES INBETWEEN


THE FOLDER NAME.(it can cause issues later)

• I have placed my SQOOP in D: drive you can use C: or any otherdrive also.

3. Setting Up Environment Variables

Another important step in setting up a work environment is to set yourSystems environment


variable.
Akash Dabi 21100BTCSE09726

To edit environment variables, go to Control Panel > System > click on the“Advanced system
settings” link
Alternatively, We can Right click on This PC icon and click on Propertiesand click on the
“Advanced system settings” link.

Or, easiest way is to search for Environment Variable in search bar and there youGO…😉

Fig. 5:- Path for Environment Variable


Akash Dabi 21100BTCSE09726

Fig. 6:- Advanced System Settings Screen

3.1 Setting SQOOP_HOME

• Open environment Variable and click on “New” in “User Variable”


Akash Dabi 21100BTCSE09726

Fig. 7:- Adding Environment Variable

• On clicking “New”, we get below screen.


Akash Dabi 21100BTCSE09726

Fig. 8:- Adding SQOOP_HOME

• Now as shown, add SQOOP_HOME in variable name and path ofSQOOP in


Variable Value.

• Click OK and we are half done with setting SQOOP_HOME.

3.2 Setting Path Variable

• Last step in setting Environment variable is setting Path in SystemVariable.


Akash Dabi 21100BTCSE09726

• Select Path variable in the system variables and click on“Edit”.

Fig. 10:- Adding Path

• Now we need to add these paths to Path Variable :-


* %SQOOP_HOME%\bin

• Click OK and OK. & we are done with Setting EnvironmentVariables.

Note:- If you want the path to be set for all users you need to select“New” from System
Variables.

3.3 Verify the Paths

• Now we need to verify that what we have done is correct andreflecting.

• Open a NEW Command Window

• Run following commands


echo%SQOOP_HOME%
Akash Dabi
21100BTCSE09726

Configure SQOOP

Once we have configured the environment variables next step is to


configure SQOOP. It has 3 parts:-

4.1 Installing MySQL Database

If you have already installed MySQL Database or any other Database like
MySQL, PostgreSQL, Oracle, SQL Server and DB2 you can skip thisstep and
move ahead.

I will be using MySQL Database as SQOOP includes fast-path connectorsfor


MySQL.

You can refer How to install MySQL from HERE.

4.2 Getting MySQL connector for SQOOP

Download mysql-connector-java.jar and put it in the libfolder of


SQOOP.

Fig 11:- Putting MySQL jar in SQOOP lib folder


Akash Dabi
21100BTCSE09726

4.3 Creating Users in MySQL

The next important step in configuring SQOOP is to create users for MySQL.
These Users are used for connecting SQOOP to MySQL Database for reading
and writing data from it.

• Firstly, we need to open the MySQL Workbench and open the


workspace(default or any specific, if you want). We will be using
thedefault
• workspace only for now.

Fig 12:- Open MySQL Workbench

Now Open the Administration option in the Workspace and select Users andprivileges
option under Management.
Akash Dabi
21100BTCSE09726

Fig 13:- Opening Users and Privileges.

4.3.1 Creating SQOOP User in MySQL

• Now select Add Account option and Create an new user with
LoginName as sqoop and Limit to Host Mapping as
the localhost and Password of your choice.
Akash Dabi
21100BTCSE09726

Fig 14:- Creating SQOOP User

• Now we have to define the roles for this user under Administrative

Roles and select DBManager ,DBDesigner and BackupAdmin Roles

Fig 15:- Assigning Roles

• Now we need to grant schema privileges for the user by using


AddEntry option and selecting the schemas we need access to.
Akash Dabi
21100BTCSE09726

Fig 16:- Schema Privileges

I am using schema matching pattern as %_bigdata% for all my bigdata relatedschemas.


You can use other 2 options also.

• After clicking OK we need to select All the privileges for this schema.

Fig 17:- Select All privileges in the schema

• Click Apply and we are done with the creating SQOOP user.

2. Connect Sqoop to the Database


o Install the MySQL JDBC driver if using MySQL:

Code
Akash Dabi
21100BTCSE09726

wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
8.0.23.zip
cp mysql-connector-java-8.0.23/mysql-connector-java-8.0.23.jar
$SQOOP_HOME/lib/
o Verify connectivity to the MySQL database:

Code
sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root --
password <password>
3. Import Data from MySQL to HDFS
o Import a specific table from MySQL to HDFS:

Code
sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password
<password> --table employees --target-dir /user/<your-username>/employees
4. Export Data from HDFS to MySQL
o Export data from HDFS to MySQL:

Code
sqoop export --connect jdbc:mysql://localhost:3306/test --username root --password
<password> --table employees --export-dir /user/<your-username>/employees
5. Experiment Code
• Sqoop commands to import and export data between Hadoop and MySQL.
6. Execution
• Use Sqoop to import/export data between HDFS and a MySQL database.
7. Observations
• Sqoop simplifies the process of transferring large datasets between RDBMS and
Hadoop.
8. Analysis
• Sqoop is highly effective for ETL processes where data needs to be transferred in bulk
between traditional databases and Hadoop.
9. Conclusion
• Successfully installed and configured Sqoop, and performed data import/export
operations between HDFS and MySQL.
10. Viva Questions
1. What is the purpose of Apache Sqoop?
2. How do you import a table from MySQL to Hadoop using Sqoop?
3. What are some advantages of using Sqoop?
4. What is the difference between import and export in Sqoop?
5. How does Sqoop handle failures during data transfer?
11. Multiple Choice Questions (MCQs)
1. What is Apache Sqoop used for?
a) Data querying
b) Data transfer between Hadoop and RDBMS
c) Data visualization
d) Data storage
Answer: b) Data transfer between Hadoop and RDBMS
2. Which of the following commands is used to import data from MySQL to HDFS?
a) sqoop export
b) sqoop import
Akash Dabi
21100BTCSE09726

c) sqoop transfer
d) sqoop load
Answer: b) sqoop import
3. What does the --connect option in Sqoop do?
a) Specifies the connection string for the database
b) Specifies the input format for data
c) Specifies the output format for data
d) Specifies the username for authentication
Answer: a) Specifies the connection string for the database
4. What does Sqoop use to transfer data from RDBMS to HDFS?
a) JDBC
b) ODBC
c) RPC
d) REST API
Answer: a) JDBC
5. Sqoop is primarily used for:
a) Data querying
b) Data processing
c) Data transfer
d) Data visualization
Answer: c) Data transfer
12. References
1. Official Sqoop Documentation: https://sqoop.apache.org/
2. Hadoop: The Definitive Guide by Tom White
3. Sqoop Cookbook by Kathleen Ting, Jarek Jarcec Cecho
Akash Dabi
21100BTCSE09726

Experiment No.: 7
Experiment Title: HBase Installation
Date:

1. Objective
• To install and configure Apache HBase.
• To create and manage tables in HBase using the HBase shell.

2. Theory
• Apache HBase: A distributed, scalable, NoSQL database built on top of Hadoop. It is
modeled after Google’s BigTable and is used for storing large amounts of sparse data.
• HBase provides real-time read/write access to data in HDFS.

3. Requirements
• Apache HBase binary package
• Hadoop installed and configured
• Zookeeper installed and running

4. Procedure
1. Download and Install HBase
Prerequisite
• Install Java JDK - You can download it from
this link. (https://www.oracle.com/java/technologies/downloads/)
The Java Development Kit (JDK) is a cross-platform software development
environment that includes tools and libraries for creating Java-based software
applications and applets.

• Download Hbase - Download Apache Hbase from this link.


(https://hbase.apache.org/downloads.html)

Steps
Step-1 (Extraction of files)
Extract all the files in C drive
Akash Dabi
21100BTCSE09726

Step-2 (Creating Folder)


Create folders named "hbase" and "zookeeper."

Step-3 (Deleting line in HBase.cmd)


Open hbase.cmd in any text editor.
Search for line %HEAP_SETTINGS% and remove it.

Step-4 (Add lines in hbase-env.cmd)


Now open hbase-env.cmd, which is in the conf folder in any text editor.
Add the below lines in the file after the comment session.
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-
Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-
XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-


Dcom.sun.management.jmxremote.authenticate=false"
Akash Dabi
21100BTCSE09726

set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-


Dcom.sun.management.jmxremote.port=10101"
set HBASE_REGIONSERVER_OPTS=%HBASE_JMX_BASE% "-
Dcom.sun.management.jmxremote.port=10102"
set HBASE_THRIFT_OPTS=%HBASE_JMX_BASE% "-
Dcom.sun.management.jmxremote.port=10103"
set HBASE_ZOOKEEPER_OPTS=%HBASE_JMX_BASE% -
Dcom.sun.management.jmxremote.port=10104"
set HBASE_REGIONSERVERS=%HBASE_HOME%\conf\regionservers
set HBASE_LOG_DIR=%HBASE_HOME%\logs
set HBASE_IDENT_STRING=%USERNAME%
set HBASE_MANAGES_ZK=true

Step-5 (Add the line in Hbase-site.xml)


Open hbase-site.xml, which is in the conf folder in any text editor.

Add the lines inside the <configuration> tag.

A distributed HBase entirely relies on Zookeeper (for cluster configuration and


management). ZooKeeper coordinates, communicates and distributes state between
the Masters and RegionServers in Apache HBase. HBase's design strategy is to use
ZooKeeper solely for transient data (that is, for coordination and state
communication). Thus, removing HBase's ZooKeeper data affects only temporary
operations – data can continue to be written and retrieved to/from HBase.
<property>
<name>hbase.rootdir</name>
<value>file:///C:/Documents/hbase-2.2.5/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/C:/Documents/hbase-2.2.5/zookeeper</value>
</property>
<property>
<name> hbase.zookeeper.quorum</name>
Akash Dabi
21100BTCSE09726

<value>localhost</value>
</property>

Step-6 (Setting Environment Variables)


Now set up the environment variables.
Search "System environment variables."

Now click on " Environment Variables."


Akash Dabi
21100BTCSE09726

Then click on "New."

Variable name: HBASE_HOME


Variable Value: Put the path of the Hbase folder.
We have completed the HBase Setup on Windows 10 procedure.

2. Start HBase
o Start the HBase service:

Code
start-hbase.sh

3. Use the HBase Shell


Akash Dabi
21100BTCSE09726

o Launch the HBase shell to create and manage tables:

Code
hbase shell

o Create a table in HBase:

Code
create 'employees', 'personal_data', 'professional_data'

o Insert data into the table:

Code
put 'employees', 'row1', 'personal_data:name', 'John'
put 'employees', 'row1', 'professional_data:salary', '50000'

4. Query the Table


o Retrieve data from the table:

Code
get 'employees', 'row1'

5. Stop HBase
o To stop HBase, use the following command:

Code
stop-hbase.sh

5. Experiment Code
• HBase shell commands to create a table, insert data, and query the table.

6. Execution
• Use the HBase shell to create and manage a table, insert data, and perform basic
queries.

7. Observations
• HBase allows for efficient management of large-scale data with real-time read/write
access.

8. Analysis
• HBase is well-suited for use cases requiring real-time access to large datasets, such as
online applications and analytics platforms.

9. Conclusion
• Successfully installed and configured HBase, and managed data in an HBase table
using the HBase shell.
Akash Dabi
21100BTCSE09726

10. Viva Questions


1. What is Apache HBase?
2. How does HBase differ from traditional relational databases?
3. What is the purpose of Zookeeper in HBase?
4. What are the key advantages of using HBase?
5. Explain the concept of column families in HBase.

11. Multiple Choice Questions (MCQs)


1. What type of database is HBase?
a) SQL
b) NoSQL
c) In-memory
d) Object-oriented
Answer: b) NoSQL
2. HBase is modeled after:
a) Google BigTable
b) Amazon DynamoDB
c) MongoDB
d) CouchDB
Answer: a) Google BigTable
3. Which of the following commands creates a table in HBase?
a) CREATE TABLE
b) CREATE
c) put
d) get
Answer: b) CREATE
4. HBase tables are:
a) Stored in RDBMS
b) Stored in HDFS
c) Stored in memory
d) Stored in S3
Answer: b) Stored in HDFS

12. References
1. Official HBase Documentation: https://hbase.apache.org/
2. HBase: The Definitive Guide by Lars George
3. Hadoop: The Definitive Guide by Tom White
4. Installation Guide:https://www.naukri.com/code360/library/hbase-installation-on-
windows.
Akash Dabi
21100BTCSE09726

Experiment No.: 8
Experiment Title: Hadoop Streaming
Date:

1. Objective
• To understand Hadoop Streaming and its functionality in processing data using
custom MapReduce scripts in Python or other non-Java languages.
• To implement and run a Hadoop Streaming job using a Mapper and Reducer script
written in Python.

2. Theory
• Hadoop Streaming:
Hadoop Streaming is a utility provided by Hadoop to allow users to create and run
MapReduce jobs with any executable or script as the Mapper and Reducer. It enables
flexibility in leveraging non-Java programming languages like Python, Ruby, or Perl
for data processing.

• Key Components:
o Mapper: Processes each line of input and emits key-value pairs.
o Reducer: Aggregates the key-value pairs emitted by the Mapper and provides
the final output.
• Workflow:
1. Data is divided into splits and sent to Mappers.
2. Mappers process the data and emit key-value pairs.
3. Reducers aggregate the key-value pairs and output the final result.
• Use Cases:
o Word count analysis.
o Log file analysis.
o Custom data transformations and aggregations.

3. Requirements
• Software:
o Hadoop (latest stable version)
o Python (for writing Mapper and Reducer scripts)
• Hardware:
o Minimum 4 GB RAM
o A system or cluster with Hadoop installed
• Dataset:
A sample text file for word count analysis (e.g., sample.txt containing text data).

4. Procedure
1. Prepare Input Data
o Create a sample text file named sample.txt:
Hadoop is a framework for distributed storage and processing of large datasets.
Streaming allows flexibility for non-Java users.
Hadoop Streaming processes data using Mapper and Reducer scripts.

2. Upload Input Data to HDFS


o Create an HDFS directory and upload the text file:
hadoop fs -mkdir /user/hadoop/streaming_input
Akash Dabi
21100BTCSE09726

hadoop fs -put sample.txt /user/hadoop/streaming_input/

3. Write Mapper Script (mapper.py)


o A Python script to emit words as keys and 1 as their values:

Python code
# mapper.py
import sys

for line in sys.stdin:


line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")

4. Write Reducer Script (reducer.py)


o A Python script to sum up word counts:

Python code
# reducer.py
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:


line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word == word:
print(f"{current_word}\t{current_count}")

5. Run Hadoop Streaming Job


o Use the Hadoop Streaming utility to execute the job:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-
*.jar \
-input /user/hadoop/streaming_input/sample.txt \
Akash Dabi
21100BTCSE09726

-output /user/hadoop/streaming_output \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py"

6. Check the Output


o View the results of the job in the HDFS output directory:

hadoop fs -ls /user/hadoop/streaming_output


hadoop fs -cat /user/hadoop/streaming_output/part-00000

5. Experiment Code
• Mapper Script:
Refer to mapper.py.
• Reducer Script:
Refer to reducer.py.

6. Execution
• Create the input file, upload it to HDFS, and run the streaming job.
• Verify the output by checking the files in the output directory.

7. Observations
• The job processes the text file, and the output displays the count of each word in the
dataset.
• The input data is divided and processed in parallel by Mappers, with Reducers
aggregating the results.

8. Analysis
• Hadoop Streaming allows non-Java developers to leverage Hadoop's distributed
processing power.
• The Mapper and Reducer scripts in Python process the data efficiently, showing
Hadoop's flexibility.

9. Conclusion
• Successfully implemented a Hadoop Streaming job using Python.
• Demonstrated the use of custom Mapper and Reducer scripts to process data.

10. Viva Questions


1. What is Hadoop Streaming, and why is it useful?
2. How does a Mapper script process input data in a Hadoop Streaming job?
3. What are the limitations of Hadoop Streaming compared to Java-based MapReduce?
4. How does the Reducer script aggregate data?
5. What are the key components required to run a Hadoop Streaming job?

11. Multiple Choice Questions (MCQs)


1. Hadoop Streaming allows users to write MapReduce jobs in:
a) Only Java
b) Only Python
c) Any language that can read and write to standard input/output
Akash Dabi
21100BTCSE09726

d) None of the above


Answer: c) Any language that can read and write to standard input/output
2. The output of a Mapper script is:
a) Final results of the job
b) Key-value pairs for further processing
c) Aggregate data
d) None of the above
Answer: b) Key-value pairs for further processing
3. Which command is used to run a Hadoop Streaming job?
a) start-streaming.sh
b) hadoop-streaming.py
c) hadoop jar
d) hdfs streaming
Answer: c) hadoop jar
4. The Reducer script in Hadoop Streaming:
a) Processes raw input data
b) Aggregates intermediate key-value pairs
c) Splits input data into smaller chunks
d) None of the above
Answer: b) Aggregates intermediate key-value pairs
5. In a Hadoop Streaming job, what format must Mapper and Reducer scripts output?
a) JSON
b) Tab-separated key-value pairs
c) XML
d) Plain text
Answer: b) Tab-separated key-value pairs

12. References
1. Official Hadoop Documentation: Hadoop Streaming
2. "Hadoop: The Definitive Guide" by Tom White
3. TutorialsPoint Hadoop Streaming Guide: Streaming Basics
Akash Dabi
21100BTCSE09726

Experiment No.: 9
Experiment Title: Creating a Mapper Function Using Python
Date:

1. Objective
• To understand the role of the Mapper in the Hadoop MapReduce framework.
• To implement a Mapper function using Python for a word count application.

2. Theory
• Mapper in MapReduce:
The Mapper function is the first phase of the MapReduce framework. It processes
input data line-by-line and produces intermediate key-value pairs that are grouped by
the Hadoop framework before being passed to the Reducer.
• Working Principle:
1. Input data is split into chunks and provided to the Mapper.
2. Each line of input is processed, and the Mapper outputs key-value pairs.
3. Hadoop's shuffle and sort phase groups all values with the same key together
for the Reducer.
• Example Use Case:
In a word count problem, the Mapper emits words as keys and 1 as the value for each
occurrence.

3. Requirements
• Software:
o Hadoop (latest stable version)
o Python (for writing the Mapper script)
• Hardware:
o Minimum 4 GB RAM
o A system or cluster with Hadoop installed
• Dataset:
A sample text file for word count analysis (e.g., sample.txt containing text data).

4. Procedure
1. Prepare Input Data
o Create a sample text file named sample.txt:
Python is a powerful programming language.
Hadoop and Python can work together.
This is an example of Mapper functionality.

2. Write the Mapper Script (mapper.py)


o A Python script to emit words as keys and 1 as their values:
Python code
# mapper.py
import sys

for line in sys.stdin:


line = line.strip() # Remove leading and trailing whitespaces
words = line.split() # Split the line into words
for word in words:
print(f"{word}\t1") # Output the word and a count of 1
Akash Dabi
21100BTCSE09726

3. Test the Mapper Script Locally


o Create an input file test_input.txt with the following content:
Python Hadoop Mapper example.

o Execute the script using the following command:


cat test_input.txt | python3 mapper.py

o Expected Output:
Python 1
Hadoop 1
Mapper 1
example. 1

4. Integrate with Hadoop MapReduce Job


o Copy the sample.txt file to HDFS:
hadoop fs -mkdir /user/hadoop/mapper_input
hadoop fs -put sample.txt /user/hadoop/mapper_input/

o Run the Mapper script in a MapReduce job using Hadoop Streaming:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-


*.jar \
-input /user/hadoop/mapper_input/sample.txt \
-output /user/hadoop/mapper_output \
-mapper "python3 mapper.py" \
-reducer /bin/cat

5. Check the Output


o View the results of the Mapper in the HDFS output directory:
hadoop fs -ls /user/hadoop/mapper_output
hadoop fs -cat /user/hadoop/mapper_output/part-00000

5. Experiment Code
• Mapper Script:
Refer to mapper.py.

6. Execution
• Create the input file, upload it to HDFS, and run the Mapper using a Hadoop
Streaming job.
• Verify the output by checking the files in the output directory.

7. Observations
• The Mapper processes the input data and emits key-value pairs for each word.
• Output consists of words as keys and 1 as the value, indicating their occurrence in the
input dataset.
Akash Dabi
21100BTCSE09726

8. Analysis
• The Mapper function successfully processes input data line-by-line and emits key-
value pairs.
• This experiment highlights how Python can be used to develop custom Mapper
functions in Hadoop Streaming.

9. Conclusion
• Successfully created and executed a Mapper function in Python.
• Demonstrated the Mapper's role in Hadoop's MapReduce workflow.

10. Viva Questions


1. What is the purpose of the Mapper in the Hadoop MapReduce framework?
2. How does the Mapper emit key-value pairs?
3. Can Python be used to write Mapper scripts in Hadoop? If so, how?
4. What happens during the shuffle and sort phase after the Mapper?
5. How do you test a Mapper script locally before integrating it with Hadoop?

11. Multiple Choice Questions (MCQs)


1. What does the Mapper output in Hadoop?
a) Final results of the job
b) Key-value pairs
c) Raw input data
d) Configuration files
Answer: b) Key-value pairs
2. Which command is used to test a Mapper script locally?
a) hadoop fs -cat
b) cat <input_file> | <mapper_script>
c) hadoop jar
d) mapper.sh
Answer: b) cat <input_file> | <mapper_script>
3. In the Mapper output, what separates keys from values?
a) Space
b) Tab (\t)
c) Comma
d) Semicolon
Answer: b) Tab (\t)
4. How does Hadoop provide input data to the Mapper?
a) As a single file
b) Line-by-line from splits
c) Entire data in memory
d) None of the above
Answer: b) Line-by-line from splits
5. What role does the reducer /bin/cat play in the Hadoop Streaming job?
a) Executes the Mapper
b) Copies Mapper output to final output
c) Aggregates key-value pairs
d) Deletes temporary files
Answer: b) Copies Mapper output to final output
Akash Dabi
21100BTCSE09726

12. References
1. Hadoop Streaming Documentation: Apache Hadoop Streaming
2. "Hadoop: The Definitive Guide" by Tom White
3. TutorialsPoint: Hadoop Streaming Basics
Akash Dabi
21100BTCSE09726

Experiment No.: 10
Experiment Title: Creating Reducer Function Using Python
Date:

1. Objective
• To understand the role of the Reducer in the Hadoop MapReduce framework.
• To implement a Reducer function using Python for a word count application.

2. Theory
• Reducer in MapReduce:
The Reducer function is the second phase of the MapReduce framework. It aggregates
and processes the grouped intermediate key-value pairs produced by the Mapper to
generate final output results.
• Working Principle:
1. The output from the Mapper (key-value pairs) is shuffled and sorted, grouping
all values associated with the same key together.
2. These grouped key-value pairs are provided to the Reducer.
3. The Reducer processes each group and generates a final result.
• Example Use Case:
In a word count problem, the Reducer aggregates the counts of each word (key) and
outputs the total count for each word.

3. Requirements
• Software:
o Hadoop (latest stable version)
o Python (for writing the Reducer script)
• Hardware:
o Minimum 4 GB RAM
o A system or cluster with Hadoop installed
• Dataset:
Intermediate output from a Mapper (e.g., word and count pairs).

4. Procedure
1. Prepare Intermediate Input Data
o Create a sample intermediate input file named mapper_output.txt to simulate
the output of a Mapper:
Python 1
Hadoop 1
Python 1
Example 1
Mapper 1
Hadoop 1

2. Write the Reducer Script (reducer.py)


o A Python script to aggregate counts for each word:

Python code
# reducer.py
import sys
Akash Dabi
21100BTCSE09726

current_word = None
current_count = 0
word = None

for line in sys.stdin:


line = line.strip() # Remove leading and trailing whitespaces
word, count = line.split("\t", 1) # Split the line into word and count

try:
count = int(count)
except ValueError:
continue # Skip lines with invalid counts

if current_word == word:
current_count += count
else:
if current_word:
# Write the result to standard output
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

# Output the last word


if current_word == word:
print(f"{current_word}\t{current_count}")

3. Test the Reducer Script Locally


o Execute the script using the following command:
cat mapper_output.txt | python3 reducer.py

o Expected Output:
Example 1
Hadoop 2
Mapper 1
Python 2

4. Integrate with Hadoop MapReduce Job


o Use the sample.txt file and the mapper.py script from Experiment 17.
o Copy the reducer.py script to the same directory.
5. Run the MapReduce Job Using Hadoop Streaming
o Execute the following command:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-
*.jar \
-input /user/hadoop/mapper_input/sample.txt \
-output /user/hadoop/reducer_output \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py"
Akash Dabi
21100BTCSE09726

6. Check the Output


o View the results of the Reducer in the HDFS output directory:
hadoop fs -ls /user/hadoop/reducer_output
hadoop fs -cat /user/hadoop/reducer_output/part-00000

5. Experiment Code
• Reducer Script:
Refer to reducer.py.

6. Execution
• Run the MapReduce job, providing the mapper.py and reducer.py scripts.
• Verify the output in the HDFS output directory.

7. Observations
• The Reducer aggregates the counts for each word received from the Mapper.
• Final output contains words and their total counts in the dataset.

8. Analysis
• The Reducer function successfully processes grouped key-value pairs and generates
aggregated results.
• This experiment highlights how Python can be used to develop custom Reducer
functions in Hadoop Streaming.

9. Conclusion
• Successfully created and executed a Reducer function in Python.
• Demonstrated the Reducer's role in Hadoop's MapReduce framework.

10. Viva Questions


1. What is the purpose of the Reducer in the Hadoop MapReduce framework?
2. How does the Reducer process the grouped key-value pairs from the Mapper?
3. Why is the shuffle and sort phase important in the MapReduce workflow?
4. Can multiple Reducers be used in a Hadoop job? If so, how are their outputs
combined?
5. How does the Hadoop framework determine when to pass data to the Reducer?

11. Multiple Choice Questions (MCQs)


1. What does the Reducer in Hadoop do?
a) Processes raw input data
b) Aggregates and processes grouped key-value pairs
c) Outputs intermediate results
d) Formats input data for the Mapper
Answer: b) Aggregates and processes grouped key-value pairs
2. In the Reducer script, what separates keys from values in the input?
a) Space
b) Tab (\t)
c) Colon
d) Comma
Answer: b) Tab (\t)
Akash Dabi
21100BTCSE09726

3. What is the default number of Reducers in a Hadoop job?


a) 0
b) 1
c) 2
d) Determined by the Mapper output
Answer: b) 1
4. Which command is used to test a Reducer script locally?
a) cat <mapper_output> | python3 <reducer_script>
b) hadoop fs -cat
c) hadoop jar
d) mapper.sh
Answer: a) cat <mapper_output> | python3 <reducer_script>
5. What happens if the Reducer script encounters invalid data during processing?
a) The job fails immediately
b) Hadoop retries the Reducer
c) The script skips invalid lines
d) The Reducer outputs an error message
Answer: c) The script skips invalid lines

12. References
1. Hadoop Streaming Documentation: Apache Hadoop Streaming
2. "Hadoop: The Definitive Guide" by Tom White
3. TutorialsPoint: Hadoop Reducer Basics
Akash Dabi
21100BTCSE09726

Experiment No.: 11
Experiment Title: Python Iterators and Generators
Date:

1. Objective
• To understand and implement Python iterators and generators.
• To explore the differences between iterators and generators in Python.
• To practice writing custom iterators and generators.
2. Theory
• Iterators: Objects in Python that implement the iter () and next () methods to
iterate over a collection of items, one at a time.

• Generators: A simpler way to create iterators using the yield keyword. A generator
function is a function that returns a generator object.

Python has a generator that allows you to create your iterator function. A generator is
somewhat of a function that returns an iterator object with a succession of values
rather than a single item. A yield statement, rather than a return statement, is used in a
generator function.
Akash Dabi
21100BTCSE09726

• Difference:
• Now, let's look at some distinctions between Iterators and Generators in Python:
Iterators Generators
A generator is a function that produces or
Iterators are the objects that use the next() method to yields a sequence of values using a yield
get the next value of the sequence. statement.

Functions are used to implement the


Classes are used to Implement the iterators. generator.

Every iterator is not a generator. Every generator is an iterator.

Generators in Python are simpler to code


Complex implementation of iterator protocols .i.e., than do the custom iterator using the yield
iter() and next(). statement.

Generators in Python are more memory


Iterators in python are less memory efficient. efficient.

All the local variables are stored before the


No local variables are used in Iterators. yield statement.

3. Requirements
• Python 3.x
• Text editor or IDE (e.g., VS Code, PyCharm)
Akash Dabi
21100BTCSE09726

4. Procedure
1. Implement a Custom Iterator
o Write a class to implement an iterator that produces a sequence of even
numbers:
Python Code
class EvenNumbers:
def init (self, max_number):
self.max = max_number
self.num = 0

def iter (self):


return self

def next (self):


if self.num <= self.max:
even = self.num
self.num += 2
return even
else:
raise StopIteration

2. Implement a Generator
o Write a generator function to yield squares of numbers:
Python Code
def square_numbers(max_number):
for i in range(max_number):
yield i * i

3. Use Iterators and Generators


o Create instances of both and iterate through them:
Python Code
# Using the custom iterator
evens = EvenNumbers(10)
for num in evens:
print(num)

# Using the generator


squares = square_numbers(5)
for square in squares:
print(square)

4. Memory Comparison
o Compare the memory usage between an iterator and a generator using
Python's sys.getsizeof() method to demonstrate the efficiency of generators.
5. Experiment Code
Python Code
class EvenNumbers:
def init (self, max_number):
self.max = max_number
self.num = 0
Akash Dabi
21100BTCSE09726

def iter (self):


return self

def next (self):


if self.num <= self.max:
even = self.num
self.num += 2
return even
else:
raise StopIteration

def square_numbers(max_number):
for i in range(max_number):
yield i * i

evens = EvenNumbers(10)
for num in evens:
print(num)

squares = square_numbers(5)
for square in squares:
print(square)

6. Execution
• Run the Code in a Python environment and observe the output of both the iterator and
generator.

7. Observations
• Iterators iterate through the sequence with each next () call.
• Generators yield values lazily, meaning they generate the value on-demand without
storing all values in memory at once.

8. Analysis
• Generators are more memory-efficient compared to iterators, especially for large
datasets because they yield values on-demand.
• Iterators are class-based and involve defining the iter () and next () methods
explicitly, whereas generators are function-based and easier to implement.

9. Conclusion
• Python iterators and generators are powerful tools for handling sequences of data.
• Generators provide an efficient way to handle large sequences by yielding values
lazily.

10. Viva Questions


1. What is an iterator in Python?
2. What is the purpose of the iter () and next () methods in Python?
3. What is a generator in Python?
4. How are generators more memory-efficient than iterators?
5. What is the use of the yield keyword in Python?
Akash Dabi
21100BTCSE09726

11. Multiple Choice Questions (MCQs)


1. Which method is used to get the next value from an iterator?
a) next ()
b) iter ()
c) get ()
d) set ()
Answer: a) next ()
2. What keyword is used to create a generator in Python?
a) return
b) def
c) yield
d) lambda
Answer: c) yield
3. Which of the following is true about generators?
a) They store the entire sequence in memory.
b) They generate values on-demand.
c) They cannot be used in a for loop.
d) They return a list.
Answer: b) They generate values on-demand.
4. What happens when a generator function reaches the end of its execution?
a) It raises StopIteration.
b) It returns None.
c) It resets the function.
d) It terminates without any exception.
Answer: a) It raises StopIteration.
5. How do you create an iterator object in Python?
a) By implementing iter () and next () methods.
b) By using the yield keyword.
c) By returning a list.
d) By using a for loop.
Answer: a) By implementing iter () and next () methods.

12. References
1. Official Python Documentation:
https://docs.python.org/3/tutorial/classes.html#iterators
2. Fluent Python by Luciano Ramalho
3. Python Cookbook by David Beazley and Brian K. Jones
4. https://www.naukri.com/code360/library/iterators-and-generators-in-python
Akash Dabi
21100BTCSE09726

Experiment No.: 12
Experiment Title: Twitter Data Sentimental Analysis Using Flume and Hive
Date:

1. Objective
• To collect Twitter data using Apache Flume.
• To store the collected data in HDFS.
• To analyze the Twitter data for sentiment analysis using Apache Hive.

2. Theory
• Apache Flume:
A distributed service designed to collect, aggregate, and transport large amounts of
log data into HDFS. Flume supports sources like Twitter streams through custom
configurations.

• Apache Hive:
A data warehousing tool that allows SQL-like querying on large datasets stored in
HDFS. Hive can process semi-structured and unstructured data using built-in
functions or UDFs.
• Sentiment Analysis:
The process of analyzing textual data to classify the sentiment as positive, negative, or
neutral. Twitter sentiment analysis helps gauge public opinion and trends.

3. Requirements
• Software:
o Apache Hadoop
o Apache Flume
o Apache Hive
o Python (for sentiment analysis scripting)
• Hardware:
o 8 GB RAM (recommended)
o Internet connection for accessing Twitter API
• Dataset:
Real-time Twitter data collected using Flume.

4. Procedure
1. Set Up Flume to Collect Twitter Data
• Install Flume:
sudo apt-get install flume-ng
Akash Dabi
21100BTCSE09726

• Configure Flume:
o Create a configuration file twitter.conf for Flume:
properties

twitterAgent.sources = twitterSource
twitterAgent.channels = memoryChannel
twitterAgent.sinks = hdfsSink

twitterAgent.sources.twitterSource.type =
org.apache.flume.source.twitter.TwitterSource
twitterAgent.sources.twitterSource.consumerKey =
<Your_Consumer_Key>
twitterAgent.sources.twitterSource.consumerSecret =
<Your_Consumer_Secret>
twitterAgent.sources.twitterSource.accessToken = <Your_Access_Token>
twitterAgent.sources.twitterSource.accessTokenSecret =
<Your_Access_Token_Secret>

twitterAgent.sources.twitterSource.keywords = Hadoop, DataScience,


BigData

twitterAgent.channels.memoryChannel.type = memory
twitterAgent.channels.memoryChannel.capacity = 1000

twitterAgent.sinks.hdfsSink.type = hdfs
twitterAgent.sinks.hdfsSink.hdfs.path =
hdfs://localhost:9000/user/hadoop/twitter_data/
twitterAgent.sinks.hdfsSink.hdfs.fileType = DataStream
twitterAgent.sinks.hdfsSink.hdfs.writeFormat = Text
twitterAgent.sinks.hdfsSink.hdfs.batchSize = 100
twitterAgent.sinks.hdfsSink.hdfs.rollSize = 0
twitterAgent.sinks.hdfsSink.hdfs.rollInterval = 600

twitterAgent.sources.twitterSource.channels = memoryChannel
twitterAgent.sinks.hdfsSink.channel = memoryChannel

• Run Flume Agent:

flume-ng agent --name twitterAgent --conf ./conf/ --conf-file twitter.conf

2. Process Data in Hive


• Load Data into Hive:
o Create a table in Hive:
Sql code
CREATE EXTERNAL TABLE twitter_data (
tweet STRING
)
STORED AS TEXTFILE
LOCATION '/user/hadoop/twitter_data/';
o Load the collected Twitter data:
Akash Dabi
21100BTCSE09726

Sql code
LOAD DATA INPATH '/user/hadoop/twitter_data/' INTO TABLE
twitter_data;

• Perform Sentiment Analysis:


o Use the Sentiment Analysis UDF in Hive:

Sql code
SELECT tweet,
CASE
WHEN tweet LIKE '%good%' THEN 'Positive'
WHEN tweet LIKE '%bad%' THEN 'Negative'
ELSE 'Neutral'
END AS sentiment
FROM twitter_data;

3. Visualize Results
Export the analyzed data to a local file for visualization:
code
hive -e "INSERT OVERWRITE LOCAL DIRECTORY
'/local/sentiment_results/'
SELECT sentiment, COUNT(*) AS count
FROM twitter_data
GROUP BY sentiment;"

Use Python with Matplotlib to create a pie chart for sentiment distribution:
Python code
import matplotlib.pyplot as plt

# Example data
labels = ['Positive', 'Negative', 'Neutral']
sizes = [45, 25, 30] # Example values, replace with actual counts
colors = ['gold', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)

plt.pie(sizes, explode=explode, labels=labels, colors=colors,


autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.title("Sentiment Analysis of Twitter Data")
plt.show()
Akash Dabi
21100BTCSE09726

5. Experiment Code
• Flume Configuration: twitter.conf.
• Hive Query: Sentiment analysis using SQL.
• Python Code: Sentiment distribution visualization.

6. Execution
1. Start Flume to collect Twitter data.
2. Load and query the data using Hive.
3. Visualize the sentiment distribution using Python.

7. Observations
• Flume successfully collected Twitter data into HDFS.
• Hive processed the data, extracting insights on public sentiment.
• Visualization provided a clear representation of sentiment distribution.

8. Analysis
• Flume integrates seamlessly with Hadoop for real-time data ingestion.
• Hive's SQL-like interface simplifies sentiment analysis on large datasets.

9. Conclusion
• Successfully conducted Twitter sentiment analysis using Flume and Hive.
• Demonstrated real-time data collection, processing, and visualization.

10. Viva Questions


1. What is the role of Flume in Hadoop?
2. How does Hive process semi-structured data?
3. What is sentiment analysis, and why is it important?
4. Why do we use external tables in Hive for this experiment?
5. How can UDFs enhance data analysis in Hive?

11. Multiple Choice Questions (MCQs)


1. Which tool is used to collect real-time Twitter data in this experiment?
a) Hive
Akash Dabi
21100BTCSE09726

b) Flume
c) Sqoop
d) Pig
Answer: b) Flume
2. What type of data does Flume ingest into Hadoop?
a) Structured
b) Unstructured
c) Semi-structured
d) All of the above
Answer: d) All of the above
3. What is the primary function of Hive in the Hadoop ecosystem?
a) Data ingestion
b) SQL-like querying of large datasets
c) Distributed storage
d) Workflow scheduling
Answer: b) SQL-like querying of large datasets
4. What is the output of the Hive query in sentiment analysis?
a) Structured data
b) Key-value pairs
c) Aggregated sentiment labels
d) JSON data
Answer: c) Aggregated sentiment labels
5. Which Python library is used for visualizing sentiment distribution?
a) pandas
b) matplotlib
c) numpy
d) seaborn
Answer: b) matplotlib

12. References
1. Apache Flume Documentation
2. Apache Hive Documentation
3. "Mining the Social Web" by Matthew A. Russell
Akash Dabi
21100BTCSE09726

Experiment No.: 13
Experiment Title: Business Insights of User Usage Records of Data Cards
Date:

1. Objective
• To analyze the usage patterns of data cards by users.
• To derive business insights that can help improve customer experience and optimize
network usage.

2. Theory
• Big Data in Telecom: Telecom companies generate vast amounts of data through user
interactions, including data usage, call records, and location information. Analyzing
this data helps in improving services, detecting patterns, and creating personalized
user experiences.
• Data Cards: Portable modems or mobile hotspots that allow users to access the
internet over a cellular network. Tracking the data card usage patterns provides
insights into peak usage times, customer preferences, and network performance.
• Business Insights: By analyzing the data usage records, telecom companies can
identify trends in network usage, optimize resources, and provide tailored service
packages to customers.

3. Requirements
• Hadoop cluster (local or distributed)
• Hive or Pig for querying and analysis
• Data usage logs (containing details like user ID, timestamp, data usage, location, etc.)

4. Procedure
1. Collect Data Usage Records
o Gather the data usage records for data cards, containing fields like user ID,
timestamp, data usage (MB/GB), and location.
2. Store Data in HDFS
o Store the collected data in HDFS for distributed storage and processing.
3. Create Hive Tables
o Create tables in Hive to store and query the data:
Sql Code
CREATE EXTERNAL TABLE data_usage (
user_id STRING,
timestamp STRING,
data_used DOUBLE,
location STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive/data_usage';

4. Query Data for Insights


o Run SQL-like queries in Hive to analyze data usage patterns:

Sql Code
Akash Dabi
21100BTCSE09726

-- Find total data usage per user


SELECT user_id, SUM(data_used) AS total_usage
FROM data_usage
GROUP BY user_id;

-- Identify peak usage times


SELECT HOUR(FROM_UNIXTIME(CAST(timestamp AS BIGINT))) AS
hour_of_day,
SUM(data_used) AS total_usage
FROM data_usage
GROUP BY HOUR(FROM_UNIXTIME(CAST(timestamp AS BIGINT)))
ORDER BY total_usage DESC;

-- Analyze data usage by location


SELECT location, SUM(data_used) AS total_usage
FROM data_usage
GROUP BY location
ORDER BY total_usage DESC;

5. Generate Business Insights


o Use the results from the queries to generate insights such as:
▪ Which users consume the most data?
▪ What are the peak hours for network usage?
▪ Which regions are consuming the most data?
▪ How can network resources be optimized based on usage patterns?
5. Experiment Code
• Hive queries for analyzing the data usage records as shown in the procedure.

6. Execution
• Execute the Hive queries to gather insights into user data card usage and network
performance.

7. Observations
• The total data usage per user provides insights into customer demand for data
services.
• Peak usage times indicate when the network experiences the most traffic.
• Data usage by location helps identify where network performance needs to be
optimized or expanded.

8. Analysis
• The results provide valuable business insights that help telecom companies improve
network services, offer targeted promotions, and manage network congestion during
peak hours.

9. Conclusion
• Successfully analyzed user usage records of data cards and derived business insights
related to peak usage times, high-demand users, and geographical data usage patterns.

10. Viva Questions


1. What are data cards, and how are they used?
Akash Dabi
21100BTCSE09726

2. How can analyzing data card usage help businesses?


3. What is the purpose of querying data usage patterns?
4. Why is it important to identify peak usage times?
5. How does location-based data usage analysis benefit telecom companies?
11. Multiple Choice Questions (MCQs)
1. What is the main advantage of analyzing data card usage?
a) Reducing storage costs
b) Improving customer service
c) Identifying market trends
d) Providing free internet
Answer: b) Improving customer service
2. Which tool is typically used for querying large datasets in Hadoop?
a) HDFS
b) Hive
c) Flume
d) Pig
Answer: b) Hive
3. What does the term “peak usage time” refer to?
a) The time when users recharge their data cards
b) The time when the network experiences the most traffic
c) The time when the network is down for maintenance
d) The time when users deactivate their data cards
Answer: b) The time when the network experiences the most traffic
4. How does location-based data analysis benefit telecom businesses?
a) Helps in identifying network issues in specific areas
b) Provides free internet to high-usage areas
c) Reduces the cost of data plans
d) Automatically disconnects inactive users
Answer: a) Helps in identifying network issues in specific areas
5. Which field in the data usage records is necessary for calculating peak usage times?
a) user_id
b) timestamp
c) data_used
d) location
Answer: b) timestamp

12. References
1. Official Hive Documentation: https://cwiki.apache.org/confluence/display/Hive/Home
2. Big Data for Dummies by Judith Hurwitz, Alan Nugent, Fern Halper, and Marcia
Kaufman
3. Big Data Analytics with Hadoop by Bhushan Lakhe
Akash Dabi
21100BTCSE09726

Experiment No.: 14
Experiment Title: Wiki Page Ranking with Hadoop
Date:

1. Objective
• To perform Wiki page ranking using Hadoop and analyze the rank of pages.
• To implement Hadoop MapReduce for distributed processing of large-scale data.
• To understand how MapReduce can be used for ranking algorithms like PageRank.

2. Theory
• PageRank Algorithm: A ranking algorithm used to measure the importance of web
pages by considering the number and quality of links to a page.
• Hadoop Ecosystem: Enables distributed computation and storage for big data
analysis. Hadoop uses HDFS for storage and MapReduce for processing.
• MapReduce: A programming model for processing and generating large datasets with
a parallel, distributed algorithm. It consists of two steps:
1. Map Phase: Processes input data and generates intermediate key-value pairs.
2. Reduce Phase: Aggregates these intermediate values to produce the final
output.

3. Requirements
Software:
• Hadoop (latest stable version, e.g., Hadoop 3.x)
• Java Development Kit (JDK 8 or above)
Hardware:
• Minimum 4 GB RAM
• Multi-core processor
• Stable internet connection

4. Procedure
4.1 Setup and Data Preparation
1. Download and Install Hadoop:
o Follow the steps mentioned in Experiment No. 1 for setting up Hadoop in
pseudo-distributed mode.
2. Prepare Dataset:
o Use a sample dataset of Wiki pages with link relationships.
o Format: <source_page> <destination_page>
o Place the dataset in HDFS:
hdfs dfs -mkdir /wiki_data
hdfs dfs -put /path/to/wiki_dataset.txt /wiki_data

4.2 Writing MapReduce Program


4.2.1 Mapper Class:
• Parses input and emits key-value pairs in the format:
<source_page, list_of_outgoing_links>.
• Calculates the initial rank of each page.
Code Snippet:
Java code
public class PageRankMapper extends Mapper<LongWritable, Text, Text,
Text> {
Akash Dabi
21100BTCSE09726

public void map(LongWritable key, Text value, Context context) throws


IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
String sourcePage = tokens[0];
String destPage = tokens[1];
context.write(new Text(sourcePage), new Text(destPage));
}
}

4.2.2 Reducer Class:


• Aggregates the links and calculates the updated rank for each page using the formula:
New Rank = (1 - d) + d * (Sum of Ranks from Inbound Links)
where d is the damping factor, typically set to 0.85.
Code Snippet:
Java code
public class PageRankReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
double sumRanks = 0.0;
for (Text value : values) {
sumRanks += Double.parseDouble(value.toString());
}
double newRank = 0.15 + 0.85 * sumRanks;
context.write(key, new Text(String.valueOf(newRank)));
}
}

4.2.3 Driver Class:


• Sets up and runs the MapReduce job.
Code Snippet:
Java code
public class WikiPageRankDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wiki PageRank");
job.setJarByClass(WikiPageRankDriver.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

4.3 Running the MapReduce Job


1. Compile the program and package it into a JAR file:
Code
javac -classpath `hadoop classpath` -d . PageRank*.java
Akash Dabi
21100BTCSE09726

jar -cvf pagerank.jar *.class

2. Run the MapReduce job:

hadoop jar pagerank.jar WikiPageRankDriver /wiki_data /output

5. Experiment Code
• See the Mapper, Reducer, and Driver code snippets above.

6. Execution
1. Upload the dataset to HDFS.
2. Execute the MapReduce job.
3. Verify the output by checking the HDFS directory:
hdfs dfs -ls /output

7. Observations
• The output will contain pages with their updated ranks.
• Example output format:
PageA 0.85
PageB 0.57

8. Analysis
• The PageRank values converge after several iterations of the MapReduce job.
• Pages with high inbound links and high-ranked incoming links have higher PageRank
scores.

9. Conclusion
• Successfully implemented Wiki Page Ranking using Hadoop MapReduce.
• Analyzed the importance of pages using the PageRank algorithm.

10. Viva Questions


1. What is the purpose of the damping factor in the PageRank algorithm?
2. How is HDFS used in this experiment?
3. Why is MapReduce suitable for PageRank computation?
4. Explain the role of the Mapper and Reducer in this experiment.

11. Multiple Choice Questions (MCQs)


1. Which component of Hadoop stores the Wiki dataset?
a) YARN
b) HDFS
c) MapReduce
d) Hive
Answer: b) HDFS
2. What is the damping factor usually set to in the PageRank algorithm?
a) 0.25
b) 0.50
c) 0.85
d) 1.0
Answer: c) 0.85
Akash Dabi
21100BTCSE09726

3. What is the key output of the MapReduce job in this experiment?


a) Dataset format changes
b) PageRank scores of pages
c) Link relationships
d) Data replication
Answer: b) PageRank scores of pages
4. How is the initial rank of a page set in the PageRank algorithm?
a) Based on the number of links
b) Randomly assigned
c) All pages start with equal rank
d) Assigned manually
Answer: c) All pages start with equal rank
5. Which file format is commonly used for MapReduce input?
a) JSON
b) XML
c) Text
d) Binary
Answer: c) Text

12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White
• https://xebia.com/blog/wiki-pagerank-with-hadoop/
Akash Dabi
21100BTCSE09726

Experiment No.: 15
Experiment Title: Health Care Data Management using Apache Hadoop
Ecosystem.
Date:

1. Objective:
• To explore how Apache Hadoop can be used to manage large-scale health care data
efficiently.
• To perform data processing on healthcare datasets using Hadoop tools such as HDFS,
MapReduce, Hive, and Pig.
• To perform basic data analysis, filtering, and aggregation tasks on health care data in
the Hadoop ecosystem.

2. Theory:
• Apache Hadoop: Hadoop is an open-source framework that allows for the distributed
storage and processing of large datasets. It uses HDFS (Hadoop Distributed File
System) for storage and MapReduce for processing data in a distributed manner.
• Components of the Hadoop Ecosystem:
o HDFS: A distributed file system that stores large datasets across multiple
nodes in a cluster.
o MapReduce: A programming model that enables distributed processing of
large data sets by splitting the data into smaller tasks.
o Hive: A data warehouse system that provides SQL-like queries for managing
large datasets stored in Hadoop.
o Pig: A high-level platform for processing and analyzing large datasets, using a
language called Pig Latin.
• Health Care Data: Health care data is often large, complex, and unstructured. It can
include patient information, medical records, diagnostic data, treatment data, etc.
Managing and analyzing such data can improve healthcare outcomes, optimize
processes, and support research.
• Example Data Sources:
o Patient demographic data (age, gender, location).
o Medical records (diagnosis, treatment plans).
o Health care provider data (doctors, hospitals, clinics).
o Treatment costs, insurance data, and more.

3. Requirements:
Software:
• Java Development Kit (JDK 8 or above)
• Apache Hadoop (latest stable version, e.g., Hadoop 3.x)
• Apache Hive
• Apache Pig (optional)
Hardware:
• Minimum 4 GB RAM
• Stable internet connection for downloading Hadoop packages
• At least 20 GB of free disk space for setting up Hadoop and storing the dataset

4. Procedure:
Step 1: Download and Install Hadoop
1. Download the latest version of Hadoop from the official Apache Hadoop website.
Akash Dabi
21100BTCSE09726

2. Extract the downloaded file to a directory (e.g., /usr/local/hadoop).


3. Set environment variables in the .bashrc or .bash_profile file:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

4. Source the .bashrc file:


source ~/.bashrc

Step 2: Configure Hadoop


1. Edit core-site.xml to configure the Hadoop temporary directory and the default
filesystem (HDFS):
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

2. Edit hdfs-site.xml to configure the replication factor and the location of the data
node:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

3. Edit mapred-site.xml to specify YARN as the framework for MapReduce:


<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

4. Edit yarn-site.xml to configure the ResourceManager:


<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Step 3: Start Hadoop Services


• Format the NameNode:
hdfs namenode -format
• Start HDFS and YARN services:
start-dfs.sh
start-yarn.sh
Akash Dabi
21100BTCSE09726

Step 4: Upload Health Care Data to HDFS


1. First, create a directory in HDFS to store health care data:
hadoop fs -mkdir /healthcare_data
2. Upload the health care data (for example, a CSV file with patient records) to HDFS:
hadoop fs -put /path/to/healthcare_data.csv /healthcare_data
Step 5: Perform Data Analysis Using MapReduce, Hive, and Pig
1. Using MapReduce:
• Create a MapReduce program to analyze the healthcare dataset, such as calculating
the average age of patients based on medical conditions.
Example steps:
1. Implement a MapReduce job with a Mapper class that processes each patient record
(e.g., extracting age and medical condition).
2. Implement a Reducer class to calculate the average age for each medical condition.
3. Run the MapReduce job using the Hadoop command:
hadoop jar healthcare-job.jar /healthcare_data /output

2. Using Hive:
1. Create a Hive table to represent the healthcare data stored in HDFS.
Sql code

CREATE EXTERNAL TABLE healthcare_data (


patient_id INT,
age INT,
gender STRING,
diagnosis STRING,
treatment STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/healthcare_data';

2. Perform queries on the data to get insights, such as the count of patients for each
diagnosis.
Sql code
SELECT diagnosis, COUNT(*) FROM healthcare_data GROUP BY diagnosis;
3. Using Pig (optional):
1. Load the data into Pig.
Pig code
healthcare_data = LOAD '/healthcare_data' USING PigStorage(',') AS
(patient_id:int, age:int, gender:chararray, diagnosis:chararray,
treatment:chararray);

2. Perform transformations and analysis using Pig Latin, such as filtering patients with a
specific diagnosis.
Pig code
patients_with_condition = FILTER healthcare_data BY diagnosis ==
'Hypertension';

3. Store the output of the analysis.


Akash Dabi
21100BTCSE09726

Pig code
STORE patients_with_condition INTO '/output';

5. Experiment Code:
MapReduce Example:
Java code
// Mapper Class
public class HealthcareMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private Text diagnosis = new Text();
private IntWritable age = new IntWritable();

public void map(LongWritable key, Text value, Context context) throws


IOException, InterruptedException {
String[] fields = value.toString().split(",");
if (fields.length > 3) {
diagnosis.set(fields[3]);
age.set(Integer.parseInt(fields[1])); // age field
context.write(diagnosis, age);
}
}
}

// Reducer Class
public class HealthcareReducer extends Reducer<Text, IntWritable, Text,
DoubleWritable> {
private DoubleWritable result = new DoubleWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context


context) throws IOException, InterruptedException {
int sum = 0, count = 0;
for (IntWritable val : values) {
sum += val.get();
count++;
}
result.set(sum / (double) count); // calculating average age
context.write(key, result);
}
}

Hive Query Example:


Sql code
CREATE EXTERNAL TABLE healthcare_data (
patient_id INT,
age INT,
gender STRING,
diagnosis STRING,
treatment STRING
)
Akash Dabi
21100BTCSE09726

ROW FORMAT DELIMITED


FIELDS TERMINATED BY ','
LOCATION '/healthcare_data';

-- Query to count patients by diagnosis


SELECT diagnosis, COUNT(*) FROM healthcare_data GROUP BY
diagnosis;

6. Execution:
• After setting up Hadoop, upload the healthcare dataset to HDFS, and execute the
MapReduce job, Hive queries, or Pig scripts to perform data analysis.

7. Observations:
• The output of the MapReduce job would be the average age of patients grouped by
medical conditions.
• The Hive query will provide counts of patients diagnosed with various conditions.
• Pig scripts can be used for more complex transformations and data filtering.

8. Analysis:
• Using Hadoop, we can efficiently manage and analyze large-scale health care data,
which can provide valuable insights for healthcare providers, researchers, and policy
makers.
• The ecosystem tools (MapReduce, Hive, Pig) provide different levels of abstraction
and performance optimizations for handling health care data.

9. Conclusion:
• Hadoop successfully enabled distributed storage and processing of large health care
datasets.
• We analyzed patient data for various metrics (e.g., average age by condition) using
MapReduce, Hive, and Pig.
• The experiment demonstrates the power of the Hadoop ecosystem in managing big
data in the healthcare industry.

10. Viva Questions:


1. What is the role of HDFS in the Hadoop ecosystem?
2. How does MapReduce process large datasets?
3. How can Hadoop tools like Hive and Pig simplify data analysis tasks?
4. How do you handle health care data in a distributed manner using Hadoop?

11. Multiple Choice Questions (MCQs):


1. Which Hadoop component is responsible for distributed storage? a) MapReduce
b) HDFS
c) Hive
d) Pig
Answer: b) HDFS
2. In which format does Pig store data by default? a) CSV
b) JSON
c) Parquet
d) PigStorage
Answer: d) PigStorage
Akash Dabi
21100BTCSE09726

3. Which tool is best suited for SQL-like queries in Hadoop? a) MapReduce


b) Pig
c) Hive
d) Spark
Answer: c) Hive
4. What is the main advantage of using Hadoop for health care data management? a)
Centralized data processing
b) High scalability and fault tolerance
c) Low-cost hardware
d) Easy-to-use interface
Answer: b) High scalability and fault tolerance

12. References:
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Apache Hive Documentation:
https://cwiki.apache.org/confluence/display/Hive/Home
• Apache Pig Documentation: https://pig.apache.org/docs/r0.17.0/
• Big Data in Healthcare by Rashmi Bansal.

You might also like