PIG Commands
PIG Commands
In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF‟s.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Complex Types
12 Bag
A bag is a collection of tuples.
Example : {(raju,30),(Mohhammad,45)}
13 Map A Map is a set of key-value pairs.
Example : [ „name‟#‟Raju‟, „age‟#30]
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a similar
way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a placeholder for
optional values. These nulls can occur naturally or can be the result of an operation.
+ a + b will give 30
Addition − Adds values on either side of the operator
/ b / a will give 2
Division − Divides left hand operand by right hand operand
% b % a will give 0
Modulus − Divides left hand operand by right hand operand
and returns remainder
Bincond − Evaluates the Boolean operators. It has three b = (a == 1)? 20: 30;
operands as shown below.
if a = 1 the value of
?: variable x = (expression) ? value1 if true : value2 if false. b is 20.
if a!=1 the value of b
is 30.
== (a = b) is not
Equal − Checks if the values of two operands are equal or not; if
true
yes, then the condition becomes true.
!= (a != b) is true.
Not Equal − Checks if the values of two operands are equal or not.
If the values are not equal, then condition becomes true.
matches f1 matches
Pattern matching − Checks whether the string in the left-hand side
'.*tutorial.*'
matches with the constant in the right-hand side.
() (Raju, 30)
Tuple constructor operator − This operator is used to
construct a tuple.
[] [name#Raja, age#30]
Map constructor operator − This operator is used to
construct a tuple.
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a relation.
Filtering
Sorting
Diagnostic Operators
Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS.
Therefore, let us start HDFS and create the following sample data in HDFS.
The above dataset contains personal details like id, first name, last name, phone number and city, of
six students.
First of all, verify the installation using Hadoop version command, as shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH variable, then you will get the
following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-
localhost.localdomain.out
localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in
HDFS with the name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
The input file of Pig contains each tuple/record in individual lines. And the entities of the record are
separated by a delimiter (In our example we used “,”).
In the local file system, create an input file student_data.txt containing data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS using put command as shown below. (You
can use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/
You can use the cat command to verify whether the file has been moved into the HDFS, as shown
below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt
Output
You can see the content of the file as shown below.
15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need
to mention the name of the relation where we want to store the data, and on the right-hand side, we
have to define how we store the data. Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Where,
relation_name − We have to mention the relation in which we want to store the data.
Input file path − We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)
function − We have to choose a function from the set of load functions provided by Apache
Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define the required schema as
follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case, the columns will be addressed
as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.
First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below.
$ Pig –x mapreduce
grunt>
Now load the data from the file student_data.txt into Pig by executing the following Pig Latin
statement in the Grunt shell.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Input file We are reading data from the file student_data.txt, which is in the /pig_data/
path directory of HDFS.
Storage We have used the PigStorage() function. It loads and stores data as structured text
function files. It takes a delimiter using which each entity of a tuple is separated, as a
parameter. By default, it takes „\t‟ as a parameter.
schema
We have stored the data using the following schema.
datatype int char array char array char array char array
Note − The load statement will simply load the data into the specified relation in Pig. To verify the
execution of the Load statement, you have to use the Diagnostic Operators which are discussed in
the next chapters.
Apache Pig - Storing Data
In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in
the file system using the store operator. This chapter explains how to store data in Apache Pig using
the Store operator.
Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Output
After executing the store statement, you will get the following output. A directory is created with the
specified name and the data will be stored in it.
2015-10-05 13:05:05,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
Verification
You can verify the stored data as shown below.
Step 1
First of all, list out the files in the directory named pig_output using the ls command as shown
below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing the store statement.
Step 2
Using cat command, list the contents of the file named part-m-00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai
Dump operator
Describe operator
Explanation operator
Illustration operator
In this chapter, we will discuss the Dump operators of Pig Latin.
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is
generally used for debugging Purpose.
Syntax
Example
Now, let us print the contents of the relation using the Dump operator as shown below.
grunt> Dump student
Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from
HDFS. It will produce the following output.
2015-10-01 15:05:27,642 [main]
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher -
100% complete
2015-10-01 15:05:27,652 [main]
INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.15.0 Hadoop 2015-10-01 15:03:11 2015-10-01 05:27 UNKNOWN
Success!
Job Stats (time in seconds):
JobId job_14459_0004
Maps 1
Reduces 0
MaxMapTime n/a
MinMapTime n/a
AvgMapTime n/a
MedianMapTime n/a
MaxReduceTime 0
MinReduceTime 0
AvgReduceTime 0
MedianReducetime 0
Alias student
Feature MAP_ONLY
Outputs hdfs://localhost:9000/tmp/temp580182027/tmp757878456,
Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager
spill count : 0Total bags proactively spilled: 0 Total records proactively spilled: 0
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us describe the relation named student and verify the schema as shown below.
grunt> describe student;
Output
Once you execute the above Pig Latin statement, it will produce the following output.
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,ci}
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us explain the relation named student using the explain operator as shown below.
grunt> explain student;
Output
It will produce the following output.
$ explain student;
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Output
On executing the above statement, you will get the following output.
grunt> illustrate student;
The GROUP operator is used to group the data in one or more relations. It collects the data having
the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY Key;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown
below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation name student_details as shown
below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
Verification
Verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation named group_data as shown below.
Here you can observe that the resulting schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with the respective
age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})
You can see the schema of the table after grouping the data using the describe command as shown
below.
grunt> Describe group_data;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;
Now, verify the content of the relation group_all as shown below.
grunt> Dump group_all;
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)}
Now, let us group the records/tuples of the relations student_details and employee_details with the
key age, as shown below.
grunt> cogroup_data = COGROUP student_details by age, employee_details by age;
Verification
Verify the relation cogroup_data using the DUMP operator as shown below.
grunt> Dump cogroup_data;
Output
It will produce the following output, displaying the contents of the relation named cogroup_data as
shown below.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)},
{ })
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },
{ (6,Maggy,22,Chennai),(1,Robin,22,newyork) })
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)},
{(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)},
{ })
(25,{ },
{(4,Sara,25,London)})
The cogroup operator groups the tuples from each relation according to age where each group
depicts a particular age value.
For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two
bags −
the first bag holds all the tuples from the first relation (student_details in this case) having
age 21, and
the second bag contains all the tuples from the second relation (employee_details in this
case) having age 21.
In case a relation doesn‟t have tuples having the age value 21, it returns an empty bag.
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records are dropped. Joins can be of the
following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have
two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as shown
below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at
least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under
different aliases (names). Therefore let us load the contents of the file customers.txt as two tables as
shown below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Syntax
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Verification
Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
Output
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows
when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B) based upon the
join-predicate. The query compares each row of A with each row of B to find all pairs of rows which
satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched
pair of rows of A and B are combined into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two relations customers and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verification
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
Output
You will get the following output that will the contents of the relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An
outer join operation is carried out in three ways −
Syntax
Given below is the syntax of performing left outer join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and orders as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Syntax
Given below is the syntax of performing right outer join operation using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example
Let us perform right outer join operation on the two relations customers and orders as shown
below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Verification
Verify the relation outer_right using the DUMP operator as shown below.
grunt> Dump outer_right
Output
It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Syntax
Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example
Let us perform full outer join operation on the two relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Verification
Verify the relation outer_full using the DUMP operator as shown below.
grun> Dump outer_full;
Output
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Syntax
Here is how you can perform a JOIN operation on two tables using multiple keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
Assume that we have two files namely employee.txt and employee_contact.txt in
the /pig_data/ directory of HDFS as shown below.
employee.txt
001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001
employee_contact.txt
001,9848022337,Rajiv@gmail.com,Hyderabad,003
002,9848022338,siddarth@gmail.com,Kolkata,003
003,9848022339,Rajesh@gmail.com,Delhi,003
004,9848022330,Preethi@gmail.com,Pune,003
005,9848022336,Trupthi@gmail.com,Bhuwaneshwar,003
006,9848022335,Archana@gmail.com,Chennai,003
007,9848022334,Komal@gmail.com,trivendram,002
008,9848022333,Bharathi@gmail.com,Chennai,001
And we have loaded these two files into Pig with relations employee and employee_contact as
shown below.
grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);
Now, let us join the contents of these two relations using the JOIN operator as shown below.
grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
Verification
Verify the relation emp using the DUMP operator as shown below.
grunt> Dump emp;
Output
It will produce the following output, displaying the contents of the relation named emp as shown
below.
(1,Rajiv,Reddy,21,programmer,113,1,9848022337,Rajiv@gmail.com,Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,siddarth@gmail.com,Kolka ta,113)
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,Rajesh@gmail.com,Delhi,113)
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,Preethi@gmail.com,Pune,113)
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,Trupthi@gmail.com,Bhuwaneshw ar,113)
(6,Archana,Mishra,23,programmer,113,6,9848022335,Archana@gmail.com,Chennai,113)
(7,Komal,Nayak,24,teamlead,112,7,9848022334,Komal@gmail.com,trivendram,112)
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,Bharathi@gmail.com,Chennai,111)
Syntax
Given below is the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example
Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of
HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as shown
below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Let us now get the cross-product of these two relations using the cross operator on these two
relations as shown below.
grunt> cross_data = CROSS customers, orders;
Verification
Verify the relation cross_data using the DUMP operator as shown below.
grunt> Dump cross_data;
Output
It will produce the following output, displaying the contents of the relation cross_data.
(7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060)
(7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560)
(7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500)
(7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000)
(6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060)
(6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560)
(6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500)
(6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000)
(5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560)
(5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500)
(5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(4,Chaitali,25,Mumbai,6500,101,2009-20 00:00:00,4,2060)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000)
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060)
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560)
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500)
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560)
(4,Chaitali,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500)
(4,Chaitali,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000)
(3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060)
(3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000)
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060)
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560)
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500)
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)
Syntax
Given below is the syntax of the UNION operator.
grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Example
Assume that we have two files namely student_data1.txt and student_data2.txt in
the /pig_data/ directory of HDFS as shown below.
Student_data1.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Student_data2.txt
7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai.
And we have loaded these two files into Pig with the relations student1 and student2 as shown
below.
grunt> student1 = LOAD 'hdfs://localhost:9000/pig_data/student_data1.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
Let us now merge the contents of these two relations using the UNION operator as shown below.
grunt> student = UNION student1, student2;
Verification
Verify the relation student using the DUMP operator as shown below.
grunt> Dump student;
Output
It will display the following output, displaying the contents of the relation student.
(1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
(7,Komal,Nayak,9848022334,trivendram)
(8,Bharathi,Nambiayar,9848022333,Chennai)
Syntax
Given below is the syntax of the SPLIT operator.
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown
below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Let us now split the relation into two, one listing the employees of age less than 23, and the other
listing the employees having the age between 22 and 25.
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25);
Verification
Verify the relations student_details1 and student_details2 using the DUMP operator as shown
below.
grunt> Dump student_details1;
Output
Syntax
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown
below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city Chennai.
filter_data = FILTER student_details BY city == 'Chennai';
Verification
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
Output
It will produce the following output, displaying the contents of the relation filter_data as follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
Syntax
Given below is the syntax of the DISTINCT operator.
grunt> Relation_name2 = DISTINCT Relatin_name1;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown
below.
student_details.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
Let us now remove the redundant (duplicate) tuples from the relation named student_details using
the DISTINCT operator, and store it as another relation named distinct_data as shown below.
grunt> distinct_data = DISTINCT student_details;
Verification
Verify the relation distinct_data using the DUMP operator as shown below.
grunt> Dump distinct_data;
Output
It will produce the following output, displaying the contents of the relation distinct_data as follows.
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown
below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Let us now get the id, age, and city values of each student from the relation student_details and
store it into another relation named foreach_data using the foreach operator as shown below.
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
Verification
Verify the relation foreach_data using the DUMP operator as shown below.
grunt> Dump foreach_data;
Output
It will produce the following output, displaying the contents of the relation foreach_data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
Let us now sort the relation in a descending order based on the age of the student and store it into another relation
named order_by_data using the ORDER BY operator as shown below.
Verification
Verify the relation order_by_data using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation order_by_data.
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
The LIMIT operator is used to get a limited number of tuples from a relation.
Syntax
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
Now, let‟s sort the relation in descending order based on the age of the student and store it into another relation
named limit_data using the ORDER BY operator as shown below.
Verification
Verify the relation limit_data using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation limit_data as follows.
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
Apache Pig provides various built-in functions namely eval, load, store, math, string, bag and tuple functions.
Eval Functions
Given below is the list of eval functions provided by Apache Pig.
1 AVG()
2 BagToString()
To concatenate the elements of a bag into a string. While concatenating, we can place a
delimiter between these values (optional).
3 CONCAT()
4 COUNT()
To get the number of elements in a bag, while counting the number of tuples in a bag.
5 COUNT_STAR()
It is similar to the COUNT() function. It is used to get the number of elements in a bag.
6 DIFF()
7 IsEmpty()
8 MAX()
To calculate the highest value for a column (numeric values or chararrays) in a single-
column bag.
9 MIN()
To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-
column bag.
10 PluckTuple()
Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter the
columns in a relation that begin with the given prefix.
11 SIZE()
12 SUBTRACT()
To subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples
of the first bag that are not in the second bag.
13 SUM()
14 TOKENIZE()
To split a string (which contains a group of words) in a single tuple and return a bag which
contains the output of the split operation.
To get the global average value, we need to perform a Group All operation, and calculate the average value using
the AVG() function.
To get the average value of a group, we need to group it using the Group By operator and proceed with the average
function.
Syntax
Given below is the syntax of the AVG() function.
grunt> AVG(expression)
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
And we have loaded this file into Pig with the relation name student_details as shown below.
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72),
(7,Komal,Nayak,24,9848022 334,trivendram,83),
(6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3 ,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89)})
Let us now calculate the global average GPA of all the students using the AVG() function as shown below.
Verification
Verify the relation student_gpa_avg using the DUMP operator as shown below.
Output
Generally bags are disordered and can be arranged by using ORDER BY operator.
Syntax
Given below is the syntax of the BagToString() function.
grunt> BagToString(vals:bag [, delimiter:chararray])
Example
Assume that we have a file named dateofbirth.txt in the HDFS directory /pig_data/ as shown below. This file contains the date-
of-births.
dateofbirth.txt
22,3,1990
23,11,1989
1,3,1998
2,6,1980
26,9,1989
And we have loaded this file into Pig with the relation name dob as shown below.
(all,{(26,9,1989),(2,6,1980),(1,3,1998),(23,11,1989),(22,3,1990)})
Here, we can observe a bag having all the date-of-births as tuples of it. Now, let‟s convert the bag to string using the
function BagToString().
Verification
Verify the relation dob_string using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation dob_string.
(26_9_1989_2_6_1980_1_3_1998_23_11_1989_22_3_1990)
Syntax
grunt> CONCAT (expression, expression, [...expression])
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
And we have loaded this file into Pig with the relation name student_details as shown below.
( 1,Rajiv,Reddy,21,9848022337,Hyderabad,89 )
( 2,siddarth,Battacharya,22,9848022338,Kolkata,78 )
( 3,Rajesh,Khanna,22,9848022339,Delhi,90 )
( 4,Preethi,Agarwal,21,9848022330,Pune,93 )
( 5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75 )
( 6,Archana,Mishra,23,9848022335,Chennai,87 )
( 7,Komal,Nayak,24,9848022334,trivendram,83 )
( 8,Bharathi,Nambiayar,24,9848022333,Chennai,72 )
And, verify the schema using describe operator as shown below.
grunt> Describe student_details;
Verification
Verify the relation student_name_concat using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_name_concat.
(RajivReddy)
(siddarthBattacharya)
(RajeshKhanna)
(PreethiAgarwal)
(TrupthiMohanthy)
(ArchanaMishra)
(KomalNayak)
(BharathiNambiayar)
We can also use an optional delimiter between the two expressions as shown below.
Now, let us concatenate the first name and last name of the student records in the student_details relation by placing „_‟ between
them as shown below.
Verification
Verify the relation student_name_concat using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_name_concat as follows.
(Rajiv_Reddy)
(siddarth_Battacharya)
(Rajesh_Khanna)
(Preethi_Agarwal)
(Trupthi_Mohanthy)
(Archana_Mishra)
(Komal_Nayak)
(Bharathi_Nambiayar)
To get the global count value (total number of tuples in a bag), we need to perform a Group All operation, and calculate
the count value using the COUNT() function.
To get the count value of a group (Number of tuples in a group), we need to group it using the Group By operator and
proceed with the count function.
Syntax
Given below is the syntax of the COUNT() function.
grunt> COUNT(expression)
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
And we have loaded this file into Pig with the relation named student_details as shown below.
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72),
(7,Komal,Nayak,24,9848022 334,trivendram,83),
(6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3 ,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89)})
Let us now calculate number of tuples/records in the relation.
Verification
Verify the relation student_count using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_count.
8
Apache Pig - COUNT_STAR()
The COUNT_STAR() function of Pig Latin is similar to the COUNT() function. It is used to get the number of elements in a bag.
While counting the elements, the COUNT_STAR() function includes the NULL values.
Note −
To get the global count value (total number of tuples in a bag), we need to perform a Group All operation, and calculate
the count_star value using the COUNT_STAR() function.
To get the count value of a group (Number of tuples in a group), we need to group it using the Group By operator and
proceed with the count_star function.
Syntax
Given below is the syntax of the COUNT_STAR() function.
grunt> COUNT_STAR(expression)
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. This file contains an
empty record.
student_details.txt
,,,,,,,
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
And we have loaded this file into Pig with the relation name student_details as shown below.
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72),
(7,Komal,Nayak,24,9848022 334,trivendram,83),
(6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3 ,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89),
( , , , , , , )})
Let us now calculate the number of tuples/records in the relation.
Verification
Verify the relation student_count using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_count.
9
Since we have used the function COUNT_STAR(), it included the null tuple and returned 9.
Syntax
Given below is the syntax of the DIFF() function.
grunt> DIFF (expression, expression)
Example
Generally the DIFF() function compares two bags in a tuple. Given below is its example, here we create two relations, cogroup
them, and calculate the difference between them.
Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/ as shown below.
The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee
details who got bonus.
emp_sales.txt
1,Robin,22,25000,sales
2,BOB,23,30000,sales
3,Maya,23,25000,sales
4,Sara,25,40000,sales
5,David,23,45000,sales
6,Maggy,22,35000,sales
emp_bonus.txt
1,Robin,22,25000,sales
2,Jaya,23,20000,admin
3,Maya,23,25000,sales
4,Alia,25,50000,admin
5,David,23,45000,sales
6,Omar,30,30000,admin
And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively.
Group the records/tuples of the relations emp_sales and emp_bonus with the key sno, using the COGROUP operator as shown
below.
Verify the relation cogroup_data using the DUMP operator as shown below.
grunt> Dump cogroup_data;
(1,{(1,Robin,22,25000,sales)},{(1,Robin,22,25000,sales)})
(2,{(2,BOB,23,30000,sales)},{(2,Jaya,23,20000,admin)})
(3,{(3,Maya,23,25000,sales)},{(3,Maya,23,25000,sales)})
(4,{(4,Sara,25,40000,sales)},{(4,Alia,25,50000,admin)})
(5,{(5,David,23,45000,sales)},{(5,David,23,45000,sales)})
(6,{(6,Maggy,22,35000,sales)},{(6,Omar,30,30000,admin)})
Verify the relation diff_data using the DUMP operator as shown below.
grunt> Dump diff_data;
({})
({(2,BOB,23,30000,sales),(2,Jaya,23,20000,admin)})
({})
({(4,Sara,25,40000,sales),(4,Alia,25,50000,admin)})
({})
({(6,Maggy,22,35000,sales),(6,Omar,30,30000,admin)})
The diff_data relation will have an empty tuple if the records in emp_bonus and emp_sales match. In other cases, it will hold
tuples from both the relations (tuples that differ).
For example, if you consider the records having sno as 1, then you will find them same in both the relations
((1,Robin,22,25000,sales), (1,Robin,22,25000,sales)). Therefore, in the diff_data relation, which is the result of DIFF() function,
you will get an empty tuple for sno 1.
Syntax
Given below is the syntax of the IsEmpty() function.
grunt> IsEmpty(expression)
Example
Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/ as shown below.
The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee
details who got bonus.
emp_sales.txt
1,Robin,22,25000,sales
2,BOB,23,30000,sales
3,Maya,23,25000,sales
4,Sara,25,40000,sales
5,David,23,45000,sales
6,Maggy,22,35000,sales
emp_bonus.txt
1,Robin,22,25000,sales
2,Jaya,23,20000,admin
3,Maya,23,25000,sales
4,Alia,25,50000,admin
5,David,23,45000,sales
6,Omar,30,30000,admin
And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively, as shown below.
Let us now group the records/tuples of the relations emp_sales and emp_bonus with the key age, using the cogroup operator as
shown below.
Verify the relation cogroup_data using the DUMP operator as shown below.
grunt> Dump cogroup_data;
(22,{(6,Maggy,22,35000,sales),(1,Robin,22,25000,sales)}, {(1,Robin,22,25000,sales)})
(23,{(5,David,23,45000,sales),(3,Maya,23,25000,sales),(2,BOB,23,30000,sales)},
{(5,David,23,45000,sales),(3,Maya,23,25000,sales),(2,Jaya,23,20000,admin)})
(25,{(4,Sara,25,40000,sales)},{(4,Alia,25,50000,admin)})
(30,{},{(6,Omar,30,30000,admin)})
The COGROUP operator groups the tuples from each relation according to age. Each group depicts a particular age value.
For example, if we consider the 1st tuple of the result, it is grouped by age 22. And it contains two bags, the first bag holds all the
tuples from the first relation (student_details in this case) having age 22, and the second bag contains all the tuples from the second
relation (employee_details in this case) having age 22. In case a relation doesn‟t have tuples having the age value 22, it returns an
empty bag.
Verification
Verify the relation isempty_data using the DUMP operator as shown below. The emp_sales relation holds the tuples that are not
there in the relation emp_bonus.
grunt> Dump isempty_data;
(30,{},{(6,Omar,30,30000,admin)})
Note −
To get the global maximum value, we need to perform a Group All operation, and calculate the maximum value using
the MAX() function.
To get the maximum value of a group, we need to group it using the Group By operator and proceed with the maximum
function.
Syntax
Given below is the syntax of the Max() function.
grunt> Max(expression)
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
And we have loaded this file into Pig with the relation name student_details as shown below.
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72),
(7,Komal,Nayak,24,9848022 334,trivendram,83),
(6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89)})
Let us now calculate the global maximum of GPA, i.e., maximum among the GPA values of all the students using
the MAX() function as shown below.
Verification
Verify the relation student_gpa_max using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_gpa_max.
(({(Bharathi),(Komal),(Archana),(Trupthi),(Preethi),(Rajesh),(siddarth),(Rajiv) } ,
{ (72) , (83) , (87) , (75) , (93) , (90) , (78)
To get the global minimum value, we need to perform a Group All operation, and calculate the minimum value using
the MIN() function.
To get the minimum value of a group, we need to group it using the Group By operator and proceed with the minimum
function.
Syntax
Given below is the syntax of the MIN() function.
grunt> MIN(expression)
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
And we have loaded this file into Pig with the relation named student_details as shown below.
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72),
(7,Komal,Nayak,24,9848022 334,trivendram,83),
(6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3 ,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89)})
Let us now calculate the global minimum of GPA, i.e., minimum among the GPA values of all the students using
the MIN() function as shown below.
Verification
Verify the relation student_gpa_min using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_gpa_min.
(({(Bharathi),(Komal),(Archana),(Trupthi),(Preethi),(Rajesh),(siddarth),(Rajiv) } ,
{ (72) , (83) , (87) , (75) , (93) , (90) , (78
Syntax
Given below is the syntax of the PluckTuple() function.
DEFINE pluck PluckTuple(expression1)
DEFINE pluck PluckTuple(expression1,expression3)
pluck(expression2)
Example
Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/.
The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee
details who got bonus.
emp_sales.txt
1,Robin,22,25000,sales
2,BOB,23,30000,sales
3,Maya,23,25000,sales
4,Sara,25,40000,sales
5,David,23,45000,sales
6,Maggy,22,35000,sales
emp_bonus.txt
1,Robin,22,25000,sales
2,Jaya,23,20000,admin
3,Maya,23,25000,sales
4,Alia,25,50000,admin
5,David,23,45000,sales
6,Omar,30,30000,admin
And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively.
Join these two relations using the join operator as shown below.
(1,Robin,22,25000,sales,1,Robin,22,25000,sales)
(2,BOB,23,30000,sales,2,Jaya,23,20000,admin)
(3,Maya,23,25000,sales,3,Maya,23,25000,sales)
(4,Sara,25,40000,sales,4,Alia,25,50000,admin)
(5,David,23,45000,sales,5,David,23,45000,sales)
(6,Maggy,22,35000,sales,6,Omar,30,30000,admin)
Syntax
Given below is the syntax of the SIZE() function.
grunt> SIZE(expression)
The return values vary according to the data types in Apache Pig.
int, long, float, double For all these types, the size function returns 1.
Char array For a char array, the size() function returns the number of characters in
the array.
Byte array For a bytearray, the size() function returns the number of bytes in the
array.
Tuple For a tuple, the size() function returns number of fields in the tuple.
Bag For a bag, the size() function returns number of tuples in the bag.
Map For a map, the size() function returns the number of key/value pairs in
the map.
Example
Assume that we have a file named employee.txt in the HDFS directory /pig_data/ as shown below.
employee.txt
1,John,2007-01-24,250
2,Ram,2007-05-27,220
3,Jack,2007-05-06,170
3,Jack,2007-04-06,100
4,Jill,2007-04-06,220
5,Zara,2007-06-06,300
5,Zara,2007-02-06,350
And we have loaded this file into Pig with the relation name employee_data as shown below.
Verification
Verify the relation size using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation size as follows. In the example, we have calculated the
size of the name column. Since it is of varchar type, the SIZE() function gives you the number of characters in the name of each
employee.
(4)
(3)
(4)
(4)
(4)
(4)
(4)
Syntax
Given below is the syntax of the SUBTRACT() function.
grunt> SUBTRACT(expression, expression)
Example
Assume that we have two files namely emp_sales.txt and emp_bonus.txt in the HDFS directory /pig_data/ as shown below.
The emp_sales.txt contains the details of the employees of the sales department and the emp_bonus.txt contains the employee
details who got bonus.
emp_sales.txt
1,Robin,22,25000,sales
2,BOB,23,30000,sales
3,Maya,23,25000,sales
4,Sara,25,40000,sales
5,David,23,45000,sales
6,Maggy,22,35000,sales
emp_bonus.txt
1,Robin,22,25000,sales
2,Jaya,23,20000,admin
3,Maya,23,25000,sales
4,Alia,25,50000,admin
5,David,23,45000,sales
6,Omar,30,30000,admin
And we have loaded these files into Pig, with the relation names emp_sales and emp_bonus respectively.
Let us now group the records/tuples of the relations emp_sales and emp_bonus with the key sno, using the COGROUP operator
as shown below.
Verify the relation cogroup_data using the DUMP operator as shown below.
grunt> Dump cogroup_data;
(1,{(1,Robin,22,25000,sales)},{(1,Robin,22,25000,sales)})
(2,{(2,BOB,23,30000,sales)},{(2,Jaya,23,30000,admin)})
(3,{(3,Maya,23,25000,sales)},{(3,Maya,23,25000,sales)})
(4,{(4,Sara,25,40000,sales)},{(4,Alia,25,50000,admin)})
(5,{(5,David,23,45000,sales)},{(5,David,23,45000,sales)})
(6,{(6,Maggy,22,35000,sales)},{(6,Omar,30,30000,admin)})
Verification
Verify the relation sub_data using the DUMP operator as shown below. The emp_sales relation holds the tuples that are not there
in the relation emp_bonus.
grunt> Dump sub_data;
({})
({(2,BOB,23,30000,sales)})
({})
({(4,Sara,25,40000,sales)})
({})
({(6,Maggy,22,35000,sales)})
In the same way, let us subtract the emp_sales relation from emp_bonus relation as shown below.
Verify the contents of the sub_data relation using the Dump operator as shown below.
grunt> Dump sub_data;
({})
({(2,Jaya,23,20000,admin)})
({})
({(4,Alia,25,50000,admin)})
({})
({(6,Omar,30,30000,admin)})
To get the global sum value, we need to perform a Group All operation, and calculate the sum value using the SUM()
function.
To get the sum value of a group, we need to group it using the Group By operator and proceed with the sum function.
Syntax
Given below is the syntax of the SUM() function.
grunt> SUM(expression)
Example
Assume that we have a file named employee.txt in the HDFS directory /pig_data/ as shown below.
employee.txt
1,John,2007-01-24,250
2,Ram,2007-05-27,220
3,Jack,2007-05-06,170
3,Jack,2007-04-06,100
4,Jill,2007-04-06,220
5,Zara,2007-06-06,300
5,Zara,2007-02-06,350
And we have loaded this file into Pig with the relation name employee_data as shown below.
(all,{(5,Zara,2007-02-06,350),
(5,Zara,2007-06-06,300),
(4,Jill,2007-0406,220),
(3,Jack,2007-04-06,100),
(3,Jack,2007-05-06,170),
(2,Ram,2007-0527,220),
(1,John,2007-01-24,250)})
Let us now calculate the global sum of the pages typed daily.
Verification
Verify the relation student_workpages_sum using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_workpages_sum as follows.
(({ (Zara), (Zara), (Jill) ,(Jack) , (Jack) , (Ram) , (John) },
{ (350) , (300) , (220) ,(100) , (170) , (220) , (250) }),1610)
Syntax
Given below is the syntax of the TOKENIZE() function.
grunt> TOKENIZE(expression [, 'field_delimiter'])
As a delimeter to the TOKENIZE() function, we can pass space [ ], double quote [" "], coma [ , ], parenthesis [ () ], star [ * ].
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. This file contains the
details of a student like id, name, age and city. If we closely observe, the name of the student includes first and last names
separated by space [ ].
student_details.txt
001,Rajiv Reddy,21,Hyderabad
002,siddarth Battacharya,22,Kolkata
003,Rajesh Khanna,22,Delhi
004,Preethi Agarwal,21,Pune
005,Trupthi Mohanthy,23,Bhuwaneshwar
006,Archana Mishra,23 ,Chennai
007,Komal Nayak,24,trivendram
008,Bharathi Nambiayar,24,Chennai
We have loaded this file into Pig with the relation name student_details as shown below.
Tokenizing a String
We can use the TOKENIZE() function to split a string. As an example let us split the name using this function as shown below.
Verification
Verify the relation student_name_tokenize using the DUMP operator as shown below.
Output
It will produce the following output, displaying the contents of the relation student_name_tokenize as follows.
({(Rajaiv),(Reddy)})
({(siddarth),(Battacharya)})
({(Rajesh),(Khanna)})
({(Preethi),(Agarwal)})
({(Trupthi),(Mohanthy)})
({(Archana),(Mishra)})
({(Komal),(Nayak)})
({(Bharathi),(Nambiayar)})
Other Delimeters
In the same way, including space [], the TOKENIZE() function accepts double quote [" "], coma [ , ], parenthesis [ () ], star [ * ] as
delimeters.
Example
Suppose there is a file named details.txt with students details like id, name, age, and city. Under the name column this file
contains the first name and the last name of the students separated by various delimeters as shown below.
details.txt
001,"siddarth""Battacharya",22,Kolkata
002,Rajesh*Khanna,22,Delhi
003,(Preethi)(Agarwal),21,Pune
We have loaded this file into Pig with the relation name details as shown below.
Now, try to separate the first name and the last name of the students using TOKENIZE() as follows.
On verifying the tokenize_data relation using dump operator you will get the following result.
grunt> Dump tokenize_data;
({(siddarth),(Battacharya)})
({(Rajesh),(Khanna)})
({(Preethi),(Agarwal)})