Big Data Lab
Big Data Lab
Big Data Lab
#!/usr/bin/env python
import sys
# Read input lines from stdin
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Split the line into words
words = line.split()
# Emit each word with a count of 1
for word in words:
print(f"{word}\t1")
#!/usr/bin/env python
import sys
import subprocess
if __name__ == "__main__":
if len(sys.argv) != 2:
sys.exit(1)
input_file = sys.argv[1]
map_process.wait()
reduce_process.wait()
Replace input.txt with the path to your input text file. The program will
read the input file, perform word counting using the MapReduce
paradigm, and output the word counts.
Keep in mind that this example is quite basic and does not take
advantage of the full power and scalability of MapReduce as implemented
in distributed computing frameworks like Apache Hadoop. It serves as a
starting point to understand the fundamental concept of MapReduce.
DUMP word_count_final;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
// Map input: key -> row number, value -> "A" or "B"
+ column + value
if (matrixName.equals("A")) {
// Emit intermediate key-value pairs for matrix
A
} else if (matrixName.equals("B")) {
if (matrixName.equals("A")) {
aRow[index] = value;
} else if (matrixName.equals("B")) {
bCol[index] = value;
int result = 0;
}
public static void main(String[] args) throws
Exception {
job.setJarByClass(MatrixMultiplication.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
6. Run the Pig Latin Scripts to find a max temp for each and every year
-- Load the temperature data
temperature_data = LOAD 'temperature_data.txt' USING
PigStorage(',') AS (date:chararray, temperature:double);
-- Extract the year from the date
temperature_data_with_year = FOREACH temperature_data
GENERATE SUBSTRING(date, 0, 4) AS year, temperature;
-- Group by year and find the max temperature for each year
max_temp_per_year = GROUP temperature_data_with_year
BY year;
max_temp_per_year = FOREACH max_temp_per_year
GENERATE group AS year,
MAX(temperature_data_with_year.temperature) AS
max_temp;
1. Creating a Database:
To create a database in Hive, you can use the CREATE DATABASE
command:
sqlCopy code
CREATE DATABASE mydatabase;
2. Creating a Table:
To create a table in Hive, you can use the CREATE TABLE command:
sqlCopy code
CREATE TABLE mytable ( id INT , name STRING, age INT );
3. Altering a Table:
You can alter an existing table to add, drop, or modify columns. For
example, to add a new column:
sqlCopy code
ALTER TABLE mytable ADD COLUMN city STRING;
4. Creating a View:
To create a view in Hive, you can use the CREATE VIEW command:
sqlCopy code
CREATE VIEW myview AS SELECT name, age FROM mytable WHERE age >
18 ;
5. Creating a Function:
Hive allows you to create custom user-defined functions (UDFs) in
various programming languages. Here's an example of creating a
UDF in Python:
sqlCopy code
CREATE FUNCTION myudf AS 'my_udf.py' USING JAR 'my_udf.jar' ;
6. Creating an Index:
Hive supports indexing for tables. You can create an index like this:
sqlCopy code
CREATE INDEX myindex ON TABLE mytable (name) AS 'COMPACT' WITH
DEFERRED REBUILD;
The WITH DEFERRED REBUILD option is used to build the index data
later.
7. Dropping:
To drop any of the above objects, you can use the respective DROP
commands. For example:
To drop a table: DROP TABLE mytable;
To drop a database: DROP DATABASE mydatabase;
To drop a view: DROP VIEW myview;
To drop a function: DROP FUNCTION myudf;
To drop an index: DROP INDEX myindex ON mytable;
Make sure you have the necessary privileges to perform these operations
in your Hive environment, as access control and user privileges play a
significant role in managing databases, tables, and other objects in Hive.
3. Write a Map Reduce program that mines weather data. Hint: Weather sensors
collecting data every hour at many locations across the globe gather a large volume
of log data, which is a good candidate for analysis with Map Reduce, since it is semi
structured and record-oriented.
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
if (fields.length >= 3) {
// Extract location and temperature
location.set(fields[0]);
temperature.set(Integer.parseInt(fields[2]));
job.setJarByClass(WeatherAnalysis.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
1. Implement the following file management tasks in Hadoop: i. Adding files and
directories ii. Retrieving files iii. Deleting files Hint: A typical Hadoop workflow
creates data files (such as log files) elsewhere and copies them into HDFS using one
of the above command line utilities
Note: Make sure you have Hadoop installed and configured on your
cluster before performing these operations.