Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Hadoop (HiveQL)
&
Data Analysis using Pandas
Purna Chander Rao.Kathula
Agenda
● Introduction to Bigdata and Hadoop
● Understanding Hive and its components.
● Hive Architecture
● Use case of stackoverflow ( datascience.stackexchange.com).
● Reporting with Pandas
Data
Volumes ( KB, MB, GB, TB, PB …… )
● Structured
○ Tabular rows and columns ( Database) ( Supports GB’s ...)
○ DWH ( Tera Data systems) and BI ( Supports TB’s )
● Semi- structured
○ Excel, XML, Json, Logs and etc...
● Un Structured
○ Audio, Video, Image and etc...
BigData
Big Data - ( Problem ) ---------------> Hadoop - ( Solution )
BIG DATA
Storage
Processing
HDFS
MapReduce
Hadoop Architecture
Master - Slave Architecture
HDFS
Hadoop Distributed File System.
1. Data Replication. ( 3 times by default)
2. 64 mb Block size. ( Current windows 8 system is 4kb)
3. Unix Like commands but use - (hyphen) before the command.
Rack Awareness
Hadoop Services
Hadoop Services
1. NameNode
2. DataNode
3. Secondary Name Node
4. Job Tracker.
5. Task Tracker
Hive
● What is Hive?
● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as
MapReduce Jobs
Hive is not
● A relational database
● A design for OnLine Transaction Processing (OLTP)
● A language for real-time queries and row-level updates
Features of Hive
● It stores schema in a database and processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.
How does Hive Work
● Hive is built on top of Hadoop
● Hive stores data in HDFS
● Hive is Schema on Read not on Write
● Hive compile SQL Queries into Mapreduce jobs and run the jobs in
Hadoop cluster
Hive Architecture
Create table Hive
Storing Schema in MySQL
Working with XML Files
Here i am taking the example of StackOverflow dataset called
Datascience.stackexchange.com
Create Table Users
Create Table Posts
Show Tables and load Data
Show Tables and load Data ( Contd..)
Joining 2 Tables (Users and Posts)
Output of Hive MR
Copy the output to local directory and rename it as results.csv , Now we load
the csv to Pandas for Data Analysis
Python Pandas
Python pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language
Problem : The problem here is to find the top 10 users on
data.stackexchange .com
Loading data into iPython Notebook
Contd ..
The loaded data does not have any headers , so we include the headers by
using the names parameter as below.
Contd ..
Here df is the dataframe object which controls the entire dataset. We can
control each column by using the Series
Contd ..
Finding the top 10 Scores and their Users
Plotting Graph (“Bar”)
%matplotlib inline
ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10).
plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' )
ax.set_xlabel ( "DisplayNames and UsersID")
ax.set_ylabel ("Scores")
THANK YOU

More Related Content

Hive and data analysis using pandas

  • 1. Hadoop (HiveQL) & Data Analysis using Pandas Purna Chander Rao.Kathula
  • 2. Agenda ● Introduction to Bigdata and Hadoop ● Understanding Hive and its components. ● Hive Architecture ● Use case of stackoverflow ( datascience.stackexchange.com). ● Reporting with Pandas
  • 3. Data Volumes ( KB, MB, GB, TB, PB …… ) ● Structured ○ Tabular rows and columns ( Database) ( Supports GB’s ...) ○ DWH ( Tera Data systems) and BI ( Supports TB’s ) ● Semi- structured ○ Excel, XML, Json, Logs and etc... ● Un Structured ○ Audio, Video, Image and etc...
  • 5. Big Data - ( Problem ) ---------------> Hadoop - ( Solution ) BIG DATA Storage Processing HDFS MapReduce
  • 6. Hadoop Architecture Master - Slave Architecture
  • 7. HDFS Hadoop Distributed File System. 1. Data Replication. ( 3 times by default) 2. 64 mb Block size. ( Current windows 8 system is 4kb) 3. Unix Like commands but use - (hyphen) before the command.
  • 9. Hadoop Services Hadoop Services 1. NameNode 2. DataNode 3. Secondary Name Node 4. Job Tracker. 5. Task Tracker
  • 10. Hive ● What is Hive? ● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as MapReduce Jobs Hive is not ● A relational database ● A design for OnLine Transaction Processing (OLTP) ● A language for real-time queries and row-level updates Features of Hive ● It stores schema in a database and processed data into HDFS. ● It is designed for OLAP. ● It provides SQL type language for querying called HiveQL or HQL. ● It is familiar, fast, scalable, and extensible.
  • 11. How does Hive Work ● Hive is built on top of Hadoop ● Hive stores data in HDFS ● Hive is Schema on Read not on Write ● Hive compile SQL Queries into Mapreduce jobs and run the jobs in Hadoop cluster
  • 15. Working with XML Files Here i am taking the example of StackOverflow dataset called Datascience.stackexchange.com
  • 18. Show Tables and load Data
  • 19. Show Tables and load Data ( Contd..)
  • 20. Joining 2 Tables (Users and Posts)
  • 21. Output of Hive MR Copy the output to local directory and rename it as results.csv , Now we load the csv to Pandas for Data Analysis
  • 22. Python Pandas Python pandas is an open source library providing high-performance, easy-to- use data structures and data analysis tools for the Python programming language Problem : The problem here is to find the top 10 users on data.stackexchange .com
  • 23. Loading data into iPython Notebook
  • 24. Contd .. The loaded data does not have any headers , so we include the headers by using the names parameter as below.
  • 25. Contd .. Here df is the dataframe object which controls the entire dataset. We can control each column by using the Series
  • 26. Contd .. Finding the top 10 Scores and their Users
  • 27. Plotting Graph (“Bar”) %matplotlib inline ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10). plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' ) ax.set_xlabel ( "DisplayNames and UsersID") ax.set_ylabel ("Scores")