Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Big Data Lab

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

2.

Run a basic word count Map Reduce program to understand Map


Reduce Paradigm

Mapper Function (map.py):

#!/usr/bin/env python
import sys
# Read input lines from stdin
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Split the line into words
words = line.split()
# Emit each word with a count of 1
for word in words:
print(f"{word}\t1")

Reducer Function (reduce.py):


#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
# Read key-value pairs from stdin
for line in sys.stdin:
# Strip the input
line = line.strip()
# Parse the input we got from map.py
word, count = line.split('\t', 1)
# Convert count to an integer
try:
count = int(count)
except ValueError:
# If count is not a number, ignore this line
continue
# If the current word is the same as the previous word, increment its count
if current_word == word:
current_count += count
else:
# If the current word is different, output the previous word's count
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
# Output the last word's count
if current_word == word:
print(f"{current_word}\t{current_count}")

Driver Script (run_wc.py):

#!/usr/bin/env python

import sys

import subprocess

if __name__ == "__main__":

# Check the number of arguments

if len(sys.argv) != 2:

print("Usage: run_wc.py <input_file>")

sys.exit(1)

input_file = sys.argv[1]

# Run the MapReduce process using subprocess

map_process = subprocess.Popen(["python", "map.py"],


stdin=open(input_file), stdout=subprocess.PIPE)

reduce_process = subprocess.Popen(["python", "reduce.py"],


stdin=map_process.stdout)

# Wait for the processes to finish

map_process.wait()

reduce_process.wait()

To run the word count program:

1. Save the mapper and reducer scripts as map.py and reduce.py,


respectively.
2. Save the driver script as run_wc.py.
3. Make sure all scripts are in the same directory.
4. Open a terminal and navigate to the directory containing the scripts.
5. Run the command: python run_wc.py input.txt where input.txt is the
text file you want to perform word count on.

Replace input.txt with the path to your input text file. The program will
read the input file, perform word counting using the MapReduce
paradigm, and output the word counts.

Keep in mind that this example is quite basic and does not take
advantage of the full power and scalability of MapReduce as implemented
in distributed computing frameworks like Apache Hadoop. It serves as a
starting point to understand the fundamental concept of MapReduce.

5. Run the Pig Latin Scripts to find Word Count

-- Load the input data

lines = LOAD 'input.txt' USING PigStorage() AS (line:chararray);

-- Tokenize each line into words

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group the words and count them

word_count = GROUP words BY word;

word_count_final = FOREACH word_count GENERATE group AS word,


COUNT(words) AS count;

-- Store the word count result

STORE word_count_final INTO 'word_count_output' USING PigStorage();

-- Display the word count result

DUMP word_count_final;

4. Implement matrix multiplication with Hadoop Map Reduce


import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

public class MatrixMultiplication {

public static class MapperClass extends


Mapper<LongWritable, Text, Text, Text> {

// Map input: key -> row number, value -> "A" or "B"
+ column + value

public void map(LongWritable key, Text value,


Context context) throws IOException,
InterruptedException {

// Split the input into parts

String[] parts = value.toString().split("\t");

String matrixName = parts[0];

int row = Integer.parseInt(parts[1]);

int col = Integer.parseInt(parts[2]);

int val = Integer.parseInt(parts[3]);

if (matrixName.equals("A")) {
// Emit intermediate key-value pairs for matrix
A

for (int k = 0; k < MatrixSize; k++) {

context.write(new Text(row + "," + k), new


Text(matrixName + "," + col + "," + val));

} else if (matrixName.equals("B")) {

// Emit intermediate key-value pairs for matrix


B

for (int i = 0; i < MatrixSize; i++) {

context.write(new Text(i + "," + col), new


Text(matrixName + "," + row + "," + val));

public static class ReducerClass extends


Reducer<Text, Text, Text, IntWritable> {

// Reduce input: key -> row,col, value -> list of A_i,k


and B_k,j

public void reduce(Text key, Iterable<Text> values,


Context context) throws IOException,
InterruptedException {

int[] aRow = new int[MatrixSize];

int[] bCol = new int[MatrixSize];


for (Text val : values) {

String[] parts = val.toString().split(",");

String matrixName = parts[0];

int index = Integer.parseInt(parts[1]);

int value = Integer.parseInt(parts[2]);

if (matrixName.equals("A")) {

aRow[index] = value;

} else if (matrixName.equals("B")) {

bCol[index] = value;

// Compute the dot product of aRow and bCol

int result = 0;

for (int i = 0; i < MatrixSize; i++) {

result += aRow[i] * bCol[i];

context.write(key, new IntWritable(result));

}
public static void main(String[] args) throws
Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "matrix


multiplication");

job.setJarByClass(MatrixMultiplication.class);

job.setMapperClass(MapperClass.class);

job.setReducerClass(ReducerClass.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new
Path(args[0]));

FileOutputFormat.setOutputPath(job, new
Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

6. Run the Pig Latin Scripts to find a max temp for each and every year
-- Load the temperature data
temperature_data = LOAD 'temperature_data.txt' USING
PigStorage(',') AS (date:chararray, temperature:double);
-- Extract the year from the date
temperature_data_with_year = FOREACH temperature_data
GENERATE SUBSTRING(date, 0, 4) AS year, temperature;

-- Group by year and find the max temperature for each year
max_temp_per_year = GROUP temperature_data_with_year
BY year;
max_temp_per_year = FOREACH max_temp_per_year
GENERATE group AS year,
MAX(temperature_data_with_year.temperature) AS
max_temp;

-- Display the maximum temperature for each year


DUMP max_temp_per_year;

7. Use Hive to create, alter, and drop databases, tables, views,


functions, and indexes.
Hive is a data warehousing tool that allows you to perform SQL-like
operations on your data stored in Hadoop Distributed File System (HDFS).
Here, I'll provide examples of how to create, alter, and drop databases,
tables, views, functions, and indexes in Hive.

1. Creating a Database:
To create a database in Hive, you can use the CREATE DATABASE
command:
sqlCopy code
CREATE DATABASE mydatabase;
2. Creating a Table:
To create a table in Hive, you can use the CREATE TABLE command:
sqlCopy code
CREATE TABLE mytable ( id INT , name STRING, age INT );
3. Altering a Table:
You can alter an existing table to add, drop, or modify columns. For
example, to add a new column:
sqlCopy code
ALTER TABLE mytable ADD COLUMN city STRING;
4. Creating a View:
To create a view in Hive, you can use the CREATE VIEW command:
sqlCopy code
CREATE VIEW myview AS SELECT name, age FROM mytable WHERE age >
18 ;
5. Creating a Function:
Hive allows you to create custom user-defined functions (UDFs) in
various programming languages. Here's an example of creating a
UDF in Python:
sqlCopy code
CREATE FUNCTION myudf AS 'my_udf.py' USING JAR 'my_udf.jar' ;
6. Creating an Index:
Hive supports indexing for tables. You can create an index like this:
sqlCopy code
CREATE INDEX myindex ON TABLE mytable (name) AS 'COMPACT' WITH
DEFERRED REBUILD;
The WITH DEFERRED REBUILD option is used to build the index data
later.
7. Dropping:
To drop any of the above objects, you can use the respective DROP
commands. For example:
 To drop a table: DROP TABLE mytable;
 To drop a database: DROP DATABASE mydatabase;
 To drop a view: DROP VIEW myview;
 To drop a function: DROP FUNCTION myudf;
 To drop an index: DROP INDEX myindex ON mytable;

Remember to be cautious when dropping objects, as it will result in the


permanent deletion of the object and its data.

Make sure you have the necessary privileges to perform these operations
in your Hive environment, as access control and user privileges play a
significant role in managing databases, tables, and other objects in Hive.

3. Write a Map Reduce program that mines weather data. Hint: Weather sensors
collecting data every hour at many locations across the globe gather a large volume
of log data, which is a good candidate for analysis with Map Reduce, since it is semi
structured and record-oriented.
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class WeatherAnalysis {

public static class WeatherMapper extends


Mapper<LongWritable, Text, Text, IntWritable> {
private Text location = new Text();
private IntWritable temperature = new IntWritable();

public void map(LongWritable key, Text value, Context


context) throws IOException, InterruptedException {
// Split the input line into fields
String[] fields = value.toString().split(",");

if (fields.length >= 3) {
// Extract location and temperature
location.set(fields[0]);
temperature.set(Integer.parseInt(fields[2]));

// Emit location as the key and temperature as the


value
context.write(location, temperature);
}
}
}

public static class WeatherReducer extends


Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable>


values, Context context)
throws IOException, InterruptedException {
int maxTemperature = Integer.MIN_VALUE;

// Find the maximum temperature for this location


for (IntWritable value : values) {
maxTemperature = Math.max(maxTemperature,
value.get());
}

// Emit location and its maximum temperature


context.write(key, new
IntWritable(maxTemperature));
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "weather analysis");

job.setJarByClass(WeatherAnalysis.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

// Specify the input and output paths


FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

1. Implement the following file management tasks in Hadoop: i. Adding files and
directories ii. Retrieving files iii. Deleting files Hint: A typical Hadoop workflow
creates data files (such as log files) elsewhere and copies them into HDFS using one
of the above command line utilities

In Hadoop, you can perform file management tasks using various


command-line utilities or programmatically using the Hadoop API. Here, I'll
provide examples of how to add files and directories, retrieve files, and
delete files using command-line utilities in Hadoop's HDFS (Hadoop
Distributed File System).

Note: Make sure you have Hadoop installed and configured on your
cluster before performing these operations.

1. Adding Files and Directories to HDFS:


To add files and directories to HDFS, you can use the hadoop fs -
copyFromLocal command. For example, to add a local file
localfile.txt to HDFS in the /user/your_username/ directory:
bashCopy code
hadoop fs -copyFromLocal localfile.txt /user/your_username/
To add an entire local directory, you can use the -put option:
bashCopy code
hadoop fs -copyFromLocal local_directory /user/your_username/
2. Retrieving Files from HDFS:
To retrieve files from HDFS to your local file system, you can use the
hadoop fs -copyToLocal command. For example, to retrieve a file
named hdfsfile.txt from HDFS to your local directory:
bashCopy code
hadoop fs -copyToLocal /user/your_username/hdfsfile.txt
local_destination_directory/
3. Deleting Files in HDFS:
To delete files in HDFS, you can use the hadoop fs -rm command.
For example, to delete a file named hdfsfile.txt in HDFS:
bashCopy code
hadoop fs - rm /user/your_username/hdfsfile.txt
To delete an entire directory and its contents, you can use the -rm -r
option:
bashCopy code
hadoop fs - rm -r /user/your_username/hdfsdirectory/

These commands allow you to perform basic file management tasks in


HDFS using the Hadoop command-line utilities. If you need to perform
these tasks programmatically in a Hadoop application, you can use the
Hadoop API in languages like Java, Python, or other supported languages.

Remember to replace /user/your_username/, localfile.txt, local_directory,


hdfsfile.txt, and hdfsdirectory with your actual HDFS paths and
file/directory names. Also, be cautious when performing deletion
operations in HDFS, as they are irreversible.

You might also like