2. Agenda
● Introduction to Bigdata and Hadoop
● Understanding Hive and its components.
● Hive Architecture
● Use case of stackoverflow ( datascience.stackexchange.com).
● Reporting with Pandas
3. Data
Volumes ( KB, MB, GB, TB, PB …… )
● Structured
○ Tabular rows and columns ( Database) ( Supports GB’s ...)
○ DWH ( Tera Data systems) and BI ( Supports TB’s )
● Semi- structured
○ Excel, XML, Json, Logs and etc...
● Un Structured
○ Audio, Video, Image and etc...
7. HDFS
Hadoop Distributed File System.
1. Data Replication. ( 3 times by default)
2. 64 mb Block size. ( Current windows 8 system is 4kb)
3. Unix Like commands but use - (hyphen) before the command.
10. Hive
● What is Hive?
● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as
MapReduce Jobs
Hive is not
● A relational database
● A design for OnLine Transaction Processing (OLTP)
● A language for real-time queries and row-level updates
Features of Hive
● It stores schema in a database and processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.
11. How does Hive Work
● Hive is built on top of Hadoop
● Hive stores data in HDFS
● Hive is Schema on Read not on Write
● Hive compile SQL Queries into Mapreduce jobs and run the jobs in
Hadoop cluster
21. Output of Hive MR
Copy the output to local directory and rename it as results.csv , Now we load
the csv to Pandas for Data Analysis
22. Python Pandas
Python pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language
Problem : The problem here is to find the top 10 users on
data.stackexchange .com