Hive and data analysis using pandas

Hadoop (HiveQL)
&
Data Analysis using Pandas
Purna Chander Rao.Kathula

Agenda
● Introduction to Bigdata and Hadoop
● Understanding Hive and its components.
● Hive Architecture
● Use case of stackoverflow ( datascience.stackexchange.com).
● Reporting with Pandas

Data
Volumes ( KB, MB, GB, TB, PB …… )
● Structured
○ Tabular rows and columns ( Database) ( Supports GB’s ...)
○ DWH ( Tera Data systems) and BI ( Supports TB’s )
● Semi- structured
○ Excel, XML, Json, Logs and etc...
● Un Structured
○ Audio, Video, Image and etc...

Big Data - ( Problem ) ---------------> Hadoop - ( Solution )
BIG DATA
Storage
Processing
HDFS
MapReduce

Hadoop Architecture
Master - Slave Architecture

HDFS
Hadoop Distributed File System.
1. Data Replication. ( 3 times by default)
2. 64 mb Block size. ( Current windows 8 system is 4kb)
3. Unix Like commands but use - (hyphen) before the command.

Hadoop Services
Hadoop Services
1. NameNode
2. DataNode
3. Secondary Name Node
4. Job Tracker.
5. Task Tracker

Hive
● What is Hive?
● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as
MapReduce Jobs
Hive is not
● A relational database
● A design for OnLine Transaction Processing (OLTP)
● A language for real-time queries and row-level updates
Features of Hive
● It stores schema in a database and processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.

How does Hive Work
● Hive is built on top of Hadoop
● Hive stores data in HDFS
● Hive is Schema on Read not on Write
● Hive compile SQL Queries into Mapreduce jobs and run the jobs in
Hadoop cluster

Working with XML Files
Here i am taking the example of StackOverflow dataset called
Datascience.stackexchange.com

Show Tables and load Data ( Contd..)

Joining 2 Tables (Users and Posts)

Output of Hive MR
Copy the output to local directory and rename it as results.csv , Now we load
the csv to Pandas for Data Analysis

Python Pandas
Python pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language
Problem : The problem here is to find the top 10 users on
data.stackexchange .com

Loading data into iPython Notebook

Contd ..
The loaded data does not have any headers , so we include the headers by
using the names parameter as below.

Contd ..
Here df is the dataframe object which controls the entire dataset. We can
control each column by using the Series

Contd ..
Finding the top 10 Scores and their Users

Plotting Graph (“Bar”)
%matplotlib inline
ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10).
plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' )
ax.set_xlabel ( "DisplayNames and UsersID")
ax.set_ylabel ("Scores")

Hive and data analysis using pandas

More Related Content

Hive and data analysis using pandas