Big Data Analytics - Project
Big Data Analytics - Project
Big Data Analytics - Project
Group Project
Submitted to
Prof. R K Jena
Submitted by
1
CONTENTS
The act of gathering and storing large amounts of information for eventual analysis is ages old.
However, a new term but with an almost similar usage have come about, Big Data. In simple
terms, big data is the data which cannot be handled by traditional RDMBS. Big data is in large
volume mostly in petabytes and zetabytes and more. Also it may be in structured or
unstructured format. This makes complicated to manage such data. But data has to be managed
and analyzed to make prediction, analyze consumer behavior, predict nature, to make better
choice and so many. Big data analytics is method to analyze data where different tools are
been used to fetch out desired results. Such tools include hadoop and other vendor specific
products. Big data analytics is making life is easier. it is larger, more complex data sets,
especially from new data sources. These data sets are so voluminous that traditional data
processing software just can’t manage them. But these massive volumes of data can be used to
address business problems you wouldn’t have been able to tackle before.
On a broad scale, data analytics technologies and techniques provide a means to analyze data
sets and draw conclusions about them which help organizations make informed business
decisions. Business intelligence (BI) queries answer basic questions about business operations
and performance. Big data analytics is a form of advanced analytics, which involves complex
applications with elements such as predictive models, statistical algorithms and what-if
analysis powered by high-performance analytics systems
in the age of Facebook, Instagram and Twitter we can't just ignore these platforms. People
praise and post negative criticism on Social Media without any second thought. It becomes
crucial to give if not more, then equal importance to it. There are many software available in
the market for Data Analytics. They provide a lot of services embedded in them. They might
have produced a scare for the independent service providers who charge these big firms a
fortune for every service. For eg : if a firm wants to extract Data from a particular website and
also use social media analytics, they charge them separately for each service. There are times
when one service provider may not even have the other analytics software. In that case the
personnel have to approach a whole different Software company to get their job done. This
creates multiple software clients, it costs a fortune, troublesome to manage so many providers.
There were companies that spent over billion dollars on employing these services annually. In
2010 data analytics industry earned billions of dollars for providing these services as a separate
entity. Big data will continue to stay growing, and introducing more and more servers is not
the best solution as it will just add to the expenses of the company. If only there was a single
compact solution to every need of every Industry, the world would be a better place to live.
Sentiment Analysis:
A large airline company started monitoring tweets about their flights to see how customers are
feeling about upgrades, new planes, entertainment, etc. Nothing special there, except when they
began feeding this information to their customer support platform and solving them in real-
time.
One memorable instance occurred when a customer tweeted negatively about lost luggage
before boarding his connecting flight. They collect the tweets (having issues) and offer him a
free first class upgrade on the way back. They also tracked the luggage and gave information
on where the luggage was, and where they would deliver it.
Needless to say, he was pretty shocked about it and tweeted like a happy camper throughout
the rest of his trip.
Sentiment analysis is the analysis of behind the data substance. A basic task in sentiment
analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect
level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is
positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for
instance, at emotional states such as “angry,” “sad,” and “happy.”
1
HDFS and Map-Reduce
Introduction
HDFS and MapReduce are Hadoop's two main parts, where HDFS is from the 'infrastructural'
perspective and MapReduce is from the 'programming' aspect. Although HDFS is currently an
Apache Hadoop sub-project, it was officially created as a web search engine infrastructure for
the Apache Nutch project.
The primary data storage system used by Hadoop applications is the Hadoop Distributed File
System (HDFS). It uses an architecture of NameNode and DataNode to implement a distributed
file system that provides high-performance data access across highly scalable clusters of
Hadoop.
HDFS is a main component of many Hadoop ecosystem techniques as it offers a reliable way
to manage big data pools and support associated large data analytics apps.
This ensures that processing can continue while data is recovered. HDFS uses master/slave
architecture. Each Hadoop cluster consisted of a single NameNode in its original incarnation
which managed file system activities and supported DataNodes which managed information
storage on individual compute nodes. The HDFS components combine with big information
sets to help apps.
2
Features of HDFS
Goals of HDFS
Fault detection and recovery - Because HDFS contains a big amount of commodity
hardware, component failure is common. HDFS should therefore have mechanisms for rapid
and automatic identification and recovery of faults.
Huge datasets - To handle apps with enormous datasets, HDFS should have hundreds of nodes
per cluster.
Hardware at data - When the computation takes place close the information, a desired task
can be performed effectively. Especially where large data sets are engaged, network traffic is
reduced and the throughput is increased
3
MapReduce
MapReduce is a method for processing and a program model for java-based distributed
computing. There are two significant tasks in the MapReduce algorithm, namely Map and
Reduce. Map requires a set of information and transforms it to another set of information where
tuples (key/value pairs) are broken down into individual components. Second, reduce task,
which takes output from a map as an input and combines these tuples of data into a smaller set
of tuples. As implied by the name sequence MapReduce, the reduction task is always carried
out after the map job. MapReduce's main benefit is that data processing is simple to scale across
various computing nodes. The primitives for information processing are called mappers and
reducers under the MapReduce model. It is sometimes not trivial to decompose a information
handling request into mappers and reducers. However, once we write an application in the form
of MapReduce, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is simply a change in configuration. This easy scalability
has drawn many programmers to use the model of MapReduce.
The whole process of computing is broken down into the phases of mapping, shuffling and
reduction.
Mapping Stage: This is the MapReduce's first phase and involves the Hadoop Distributed File
System (HDFS) data reading process. The information may be in a folder or file format. The
input data file is supplied one line at a moment in the mapper function. Then the mapper
processes the information and decreases it to narrower information blocks.
Reducing Stage: Multiple procedures can consist of the reducer stage. The information is
transmitted from the mapper to the reducer during the shuffling phase. There would be
no input into the reducer phase without the successful shuffling of the data. But even
before the mapping method is complete, the shuffling method can begin. Next, the
information will be sorted to decrease the time taken to decrease the information. By
offering a cue when the next key in the sorted input information is separate from the
earlier key, the sorting effectively enables the reduction process. The reduction task
requires a specific pair of key-value to call the reduction function which takes the key-
value as its input. The reducer's output can be implemented straight to be stored in the
HDFS.
4
Hive
Introduction
Hive is an infrastructure instrument for data warehouse processing structured information in
Hadoop. To summarize Big Data, it lies on top of Hadoop, making it simple to query and
analyze.
Hive is an open source software that allows programmers to analyze Hadoop's big information
sets. In the company intelligence sector, the volume of information sets being gathered and
analyzed is increasing, making traditional information warehousing solutions more costly.
Hadoop with MapReduce framework is used as an option to analyze enormous size information
sets.
Although Hadoop has proven to be helpful in operating on enormous information sets, its
MapReduce framework is very small, requiring programmers to write custom programs that
are difficult to keep and reuse. Hive arrives here for programmers to be rescued.
Hive offers a declarative language similar to SQL, called HiveQL, which is used to express
queries. Using SQL-related Hive-QL users can readily conduct data analysis. These queries are
compiled by Hive engine into Map-Reduce employment to be performed on Hadoop.
Additionally, it is also possible to plug custom Map-Reduce scripts into queries.
Hive has three primary tasks: summing up, querying and analyzing information. It supports
queries expressed in a language called HiveQL, or HQL, a declarative SQL-like language that
translated SQL-style queries automatically into MapReduce tasks performed on the Hadoop
platform in its first incarnation. Additionally, to plug into queries, HiveQL endorsed custom
MapReduce scripts.
When SQL queries are presented via Hive, a driver element that generates session handles
originally receives them, forwarding applications to a compiler via Java Database Connectivity
/ Open Database Connectivity interfaces that eventually forward employment for execution.
Hive allows information serialization / deserialization and improves schema design flexibility
by including a system catalog called Hive-Metastore.
5
Hive engine have been supported by HiveQL and the Hive Engine, adding assistance for
distributed process implementation via Apache Tez and Spark.
Early Hive file support consisted of text files (also known as flat files), SequenceFiles (flat files
composed of binary key / value pairs) and Record Columnar Files (RCFiles) that store table
rows in a columnar database fashion. Hive storage support for columnar has come to include
Optimized Row Columnar (ORC) files and files for parquet.
Since its beginnings, hive execution and interactivity have been a subject of attention. That's
because the results of queries lagged behind those of more familiar SQL motors. Apache Hive
committers started work on the Stinger project in 2013 to increase efficiency, bringing Apache
Tez to the warehouse system and directing acyclic graph processing.
Uses of Hive:
1. The storage distributed by Apache Hive.
4. We can access documents stored in Hadoop Distributed File System (HDFS is used to
query and manage big residing datasets) or other information storage applications such
as Apache HBase by using Hive.
Limitations of Hive
1. Hive is not intended for processing online transactions (OLTP), it is used only for online
analytical processing.
2. Hive supports information overwriting or apprehension, but does not update and delete
information.
3. Sub-queries are not supported in Hive.
Hive-QL is a declarative language line SQL, Pig Latin is a language of data flow. Pig: a
language and environment for the data flow to explore very large datasets. Hive: a warehouse
of distributed data.
6
Hive Commands:
Data Definition Language (DDL)
DDL statements are used to build and modify the tables and other objects in the database.
Example:
CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.
Go to Hive shell by giving the command sudo hive and enter the
command ‘create database<data base name>’ to create the new database in the Hive.
To list out the databases in Hive warehouse, enter the command ‘show databases’.
The command to use the database is USE <data base name>
Describe provides information about the schema of the table.
Data Manipulation Language (DML)
DML statements are used to retrieve, store, modify, delete, insert and update data in the
database.
Example :
LOAD, INSERT Statements.
Syntax :
LOAD data <LOCAL> inpath <file path> into table [tablename]
Insert Command:
The insert command is used to load the data Hive table. Inserts can be done to a table or a
partition.
• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
• INSERT INTO is used to append the data into existing data in a table. (Note: INSERT INTO
syntax is work from the version 0.8)
7
Hive Case study
Questions:
1. Find the performance matrix of all the players based on ID
2. Find the sum of total wages of all players
3. Find the scope for improving to potential score for each player.
4. Find the player with 5-star skill moves
5. Find the body mass index for each player
6. Find the count of nationality brazil.
7. Compare the total value for players belonging to Nationality FRANCE
8. How many distinct countries have players playing football?
9. What is the average wage of a football player?
10. Find 10 distinct clubs for the top value players.
List of Queries:
Create table Fifa1 (id int, name string, foot string, position string, age int, overall
int, potential int, rep int, skills int, height double, weight double) row format
delimited fields terminated by ‘ ‘ lines delimited by ‘\nj’ stored as textfile;
Create table Fifa2 (id int, name string, age int, nationality string, club string, value
double, wage int, contract string, clause double) row format delimited fields
terminated by ‘ ‘ lines delimited by ‘\nj’ stored as textfile;
Set hive.cli.print.header= true;
Hadoop.fs –put Fifa1 /userFaizan hadoop fs –ls /user/Faizan
Hadoop.fs –put Fifa2 /userFaizan hadoop fs –ls /user/Faizan
Load data local inpath ‘/home/cloudera/Faizaqn/Hive/Fifa19 Bigdata Dataset.csv’
overwrite into tables Fifa1 ;
Load data local inpath ‘/home/cloudera/Faizaqn/Hive/Fifa19 Bigdata
Dataset2.csv’ overwrite into tables Fifa2;
Select * from Fifa1;
8
Select * from Fifa2;
9
ID Overall Potential Difference
158023 94 94 0
20801 94 94 0
190871 92 93 1
193080 91 93 2
192985 91 92 1
183277 91 91 0
177003 91 91 0
176580 91 91 0
155862 91 91 0
200389 90 93 3
188545 90 90 0
182521 90 90 0
182493 90 90 0
168542 90 90 0
215914 89 90 1
211110 89 94 5
202126 89 91 2
194765 89 90 1
192448 89 92 3
192119 89 90 1
189511 89 89 0
179813 89 89 0
167495 89 89 0
153079 89 89 0
138956 89 89 0
231747 88 95 7
209331 88 89 1
200145 88 90 2
10
204485 R. Mahrez Left 5
Z.
41236 Ibrahimovi? Right 5
202556 M. Depay Right 5
193082 J. Cuadrado Right 5
183898 A. Di María Left 5
20775 Quaresma Right 5
213345 K. Coman Right 5
208808 Q. Promes Right 5
156616 F. Ribéry Right 5
Gelson
227055 Martins Right 5
F.
212404 Bernardeschi Left 5
198717 W. Zaha Right 5
5. Alter Table Fifa1 add (h1 double, w1 double, BMI double) where BMI =
weight/(height*height), h1= heighjt*0.3048, w1=weight* 0.453592;
Select id, name, BMI,h1,w1 from Fifa1;
ID Name BMI h1 w1
158023 L. Messi 24.66438 1.71 72.12
Cristiano
20801 Ronaldo 23.99333 1.86 83.01
190871 Neymar Jr 21.71751 1.77 68.04
193080 De Gea 20.67151 1.92 76.20
192985 K. De Bruyne 29.72363 1.53 69.85
183277 E. Hazard 24.4205 1.74 73.94
177003 L. Modri? 21.87357 1.74 66.22
176580 L. Suárez 26.59953 1.80 86.18
155862 Sergio Ramos 25.33955 1.80 82.10
200389 J. Oblak 25.17333 1.86 87.09
R.
188545 Lewandowski 24.63957 1.80 79.83
182521 T. Kroos 23.51959 1.80 76.20
182493 D. Godín 22.55111 1.86 78.02
168542 David Silva 22.17321 1.74 67.13
215914 N. Kanté 25.55312 1.68 72.12
211110 P. Dybala 31.97175 1.53 74.84
202126 H. Kane 25.69778 1.86 88.90
194765 A. Griezmann 23.31013 1.77 73.03
192448 M. ter Stegen 24.51778 1.86 84.82
11
192119 T. Courtois 24.52849 1.98 96.16
Sergio
189511 Busquets 22.02667 1.86 76.20
Brazil 738
7. select id, Value, nationality from Fifa1, Fifa2 where nationality where
Nationality=’France’ and Fifa1.id=Fifa2.id;
ID Nationality Value
235456 France 600
231103 France 600
184763 France 600
240057 France 600
232117 France 600
244402 France 600
240050 France 600
200876 France 600
243627 France 1.1
177568 France 600
172952 France 600
228759 France 600
244117 France 600
228240 France 600
215914 France 63
225168 France 600
237198 France 600
194765 France 78
237708 France 1.1
244350 France 600
220030 France 600
225149 France 600
213368 France 600
209784 France 600
Select sum(Value) from Fifa1, Fifa2 where nationality where Nationality=’France’ and
Fifa1.id=Fifa2.id;
France 100940.4
12
8. select nationality, count(id) group by nationality from Fifa2;
Count of
Country Nationality
Albania 25
Algeria 54
Angola 11
Antigua &
Barbuda 1
Argentina 681
Armenia 8
Australia 89
Austria 146
Azerbaijan 3
Barbados 1
Belarus 4
Belgium 184
Benin 11
Bermuda 1
Bolivia 17
Bosnia
Herzegovina 44
Brazil 738
Bulgaria 17
Burkina Faso 13
Burundi 1
Cameroon 62
Canada 27
Cape Verde 19
Central African
Rep. 3
Chad 2
Chile 222
China PR 84
Colombia 351
Comoros 4
Congo 10
Costa Rica 24
Croatia 93
13
10. select id,Name, value, club from Fifa1, Fifa2 group by (distinct nationality) order by
value limit 10 where Fifa1.id=Fifa2.id;
14
Introduction to PIG
Apache Pig is a platform for analyzing big information sets as information flows. It is intended
to give MapReduce an abstraction, decreasing the complexity of writing a MapReduce
program. With Apache Pig, we can very readily conduct information manipulation activities in
Hadoop.
The features of Apache pig are:
Pig allows programmers without knowing Java to write complex data transformations.
Apache Pig has two primary parts–the language of Pig Latin and the environment of
Pig Run-time in which Pig Latin programs are performed.
Pig provides a simple data flow language known as Pig Latin for Big Data Analytics
that has SQL-like functionalities such as join, filter, limit, etc.
Developers who are working with scripting languages and SQL, leverages Pig Latin.
This gives developers ease of programming with Apache Pig. Pig Latin provides a
variety of built-in operators to read, write, and process large data sets, such as join, sort,
filter, etc. Thus it is evident, Pig has a rich set of operators.
Programmers write scripts using Pig Latin to analyze data and these scripts are interna
lly converted to Map and Reduce tasks by Pig MapReduce Engine. Before Pig, writin
g MapReduce tasks was the only way to process the data stored in HDFS.
If a programmer wants to write custom functions which is unavailable in Pig, Pig allows
them to write User Defined Functions (UDF) in any language of their choice like Java,
Python, Ruby, Jython, JRuby etc. and embed them in Pig script. This provides
extensibility to Apache Pig.
Pig can process any kind of data, i.e. structured, semi-structured or unstructured data,
coming from various sources.
Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code. It can
handle inconsistent schema (in case of unstructured data). Apache Pig extracts the data,
performs operations on that data and dumps the data in the required format in HDFS
i.e. ETL (Extract Transform Load).
Apache Pig automatically optimizes the tasks before execution, i.e. automatic
optimization. It allows programmers and developers to concentrate upon the whole
operation irrespective of creating mapper and reducer functions separately.
Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache
Pig is used:
Where we need to process, huge data sets like Web logs, streaming online data, etc.
Where we need Data processing for search platforms (different types of data needs to be
processed) like Yahoo uses Pig for 40 percent of their jobs including news feeds and search
engine.
15
Where we need to system time touchy statistics loads. Here, data wishes to be extracted and
analyzed speedy. E.G. Device gaining knowledge of algorithms calls for time sensitive
statistics masses, like twitter desires to fast extract statistics of consumer sports (i.E. Tweets,
re-tweets and likes) and examine the information to find styles in consumer behaviors, and
make guidelines straight away like trending tweets.
Apache Pig Tutorial: Architecture
For writing a Pig script, we need Pig Latin language and to execute them, we need execution
surroundings. The architecture of Apache Pig is proven within the under picture.
Initially as illustrated within the above picture, we put up Pig scripts to the Apache Pig
execution surroundings which may be written in Pig Latin using integrated operators.
There are 3 ways to execute the Pig script:
Grunt Shell: This is Pig’s interactive shell supplied to execute all Pig Scripts.
Script File: Write all the Pig commands in a script report and execute the Pig script record.
This is performed by means of the Pig Server.
Embedded Script: If some functions are unavailable in built-in operators, we will
programmatically create User Defined Functions to carry that functionalities the usage of other
languages like Java, Python, Ruby, and so forth. And embed it in Pig Latin Script file. Then,
execute that script document.
Parser
16
From the above photograph you may see, after passing via Grunt or Pig Server, Pig Scripts are
surpassed to the Parser. The Parser does type checking and exams the syntax of the script. The
parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and
logical operators. The logical operators are represented because the nodes and the facts flows
are represented as edges.
Optimizer
Then the DAG is submitted to the optimizer. The Optimizer performs the optimization sports
like break up, merge, remodel, and reorder operators and so on. This optimizer provides the
automated optimization function to Apache Pig. The optimizer basically goals to lessen the
quantity of records inside the pipeline at any example of time even as processing the extracted
facts, and for that it performs functions like:
PushUpFilter: If there are a couple of situations in the filter out and the filter out can be
cut up, Pig splits the conditions and pushes up every circumstance one at a time.
Selecting those situations earlier, enables in decreasing the range of facts ultimate in
the pipeline.
PushDownForEachFlatten: Applying flatten, which produces a go product among a
complex kind consisting of a tuple or a bag and the alternative fields within the
document, as past due as possible in the plan. This continues the variety of records low
inside the pipeline.
ColumnPruner: Omitting columns which are by no means used or not wanted, lowering
the dimensions of the document. This can be applied after every operator, so that fields
can be pruned as aggressively as viable.
MapKeyPruner: Omitting map keys which are by no means used, reducing the size of
the file.
LimitOptimizer: If the limit operator is right away implemented after a load or type
operator, Pig converts the load or kind operator into a restrict-touchy implementation,
which does not require processing the complete information set. Applying the
restriction in advance, reduces the quantity of facts.
Compiler
After the optimization procedure, the compiler compiles the optimized code into a chain of
MapReduce jobs. The compiler is the only who is answerable for converting Pig jobs
robotically into MapReduce jobs.
Execution engine
Finally, as shown inside the parent, these MapReduce jobs are submitted for execution to the
execution engine. Then the MapReduce jobs are executed and offers the required end result.
The result can be displayed at the display the use of “DUMP” declaration and can be saved
inside the HDFS the use of “STORE” assertion.
17
PIG Case study
Questions:
1. Display Total Score and Roll No. of all students?
2. List Math_Score in Descending order?
3. List the Roll No. of student who are Male?
4. Display the parental education level of all the student whose test is completed?
5. List the roll_no & writing_score & reading_score of all the student who has
Total_Score more than 180?
6. List the Total_Score of the Students who takes Standard Lunch?
7. Display the Race, Number of employees, and maximum Total_Score of each Race?
List of Queries:
A = load ‘/user/bapna/StudentScore_Pig.csv’ using PigStorage() as (Roll_no:long,
math_score:int, reading_score:int, writing_score:int, Total_Score:int);
Dump A;
B = load ‘/user/bapna/Students_Pig.csv’ using PigStorage() as (Roll no:long,
gender:Chararray, race:Chararray, parental level of education:Chararray,
lunch:Chararray, test preparation course:Chararray);
Dump B;
18
11002 247
11003 278
11004 148
11005 229
11006 232
11007 275
11008 122
11009 195
11010 148
11011 164
Math
Score
100
100
100
100
100
100
100
99
99
99
99
Roll No gender
19
11004 male
11005 male
11008 male
11009 male
11011 male
11012 male
11014 male
11017 male
11019 male
11021 male
11023 male
Q4) Display the parental education level of all the student whose test is completed ?
F = foreach A generate test preparation course, parental level of education;
F = filter F by test preparation course == 'Completed';
F = group F All;
F = foreach F generate parental education level;
DUMP F;
20
11047 associate's degree completed
11049 associate's degree completed
11050 high school completed
11052 associate's degree completed
Q5) List the roll_no & writing_score & reading_score of all the student who has
Total_Score more than 180 ?
G = filter A by Total_Score > 180;
G = foreach G generate Roll_no, writing_score, reading_score;
DUMP G;
Q6) List the Total_Score of the Students who takes Standard Lunch ?
H = JOIN A by Roll_no, B by Roll_no;
H = filter H by B::lunch == 'Standard';
H = foreach H generate A::Total_Score;
DUMP H;
Total
Roll No Score lunch
21
11001 218 standard
11002 247 standard
11003 278 standard
11005 229 standard
11006 232 standard
11007 275 standard
11011 164 standard
11012 135 standard
11013 219 standard
11014 220 standard
11015 161 standard
Q7) Display the Race, Number of employees, and maximum Total_Score of each Race ?
I = JOIN A by Roll_no, B by Roll_no;
I = group B by Race;
I = foreach I generate group, MAX(A.Total_Score) as Score, COUNT(B.Race) as
count;
DUMP I:
22
Conclusion & Learning:
Big Data accessibility, low-cost commodity hardware, and the latest information management
and analytics software have developed a distinctive time in data analysis history. The
convergence of these trends implies that for the first moment in history we have the capacities
to analyse amazing information sets rapidly and cost-effectively. These are neither theoretical
nor trivial skills. They constitute a real step forward and a clear chance to make huge gains in
effectiveness, productivity, income, and profitability.
The Big Data Analytics Era is here, and these are genuinely revolutionary times if company
and technology experts keep working together and delivering on the promise.
The needs and importance of Big data analytics in various business contexts.
Understanding the challenges of managing Big data.
Use of Hive and Pig for finding key elements of dataset.
Difference in Hive and Pig Coding, to infer useful elements.
Finding relationship from different datasets at one time.
23