Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
PROJECT 1: Analyzing clickstream data
On a Web site, clickstream analysis (sometimes called clickstream analytics) is the process of collecting, analyzing, and
reporting aggregate data about which pages visitors visit in what order - which are the result of the succession of mouse
clicks each visitor makes (that is, the clickstream).
Download Link
1. Loading the data files into HDFS
2. Starting the new Beeline shell (hive-server 2)
3. Creating new database – alabs_db
4.Creating and loading HIVE table – users
Sagnik_AnalytixLabs_Projects
5. All 3 HIVE base tables – omniturelogs, products and users created
6. Content of HIVE script – webanalytics.sql
6. Using webanalytics.sql, omniture and webanalytics tables are created
7. Creating omniture2 view
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
PROJECT 2: Sentiment
Analysis/Opinion Mining
Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and
computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely
applied to reviews and social media for a variety of applications, ranging from marketing to customer service.
Data Download Link
Tableau Link
1. Loading the data files into HDFS
2. Content of twitter_conf.conf file
3. Executing the TwitterAgent flume agent using twitter_conf.conf file
4. Twitter data moved to HDFS
5. Content of tweets.sql file
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
6. Executing tweets.sql to create tables and views for analysis
7. Tables and views for analysis are created
Tweets ID sentiment
PROJECT 3: Lending
Club Loan Analysis
Lending Club is a US peer-to-peer lending company. Lending Club operates an online lending platform that enables
borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the
world's largest peer-to-peer lending platform.
Data Download Link
Tableau Link
1. Loading the data files into HDFS
2. Content of loan_analysis.sql file
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
3. Tables and view created using loan_analysis.sql
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
PROJECT 4: HVAC
Temperature Analysis
HVAC (stands for Heating, Ventilation and Air Conditioning) equipment needs a control system to regulate the operation of
a heating and/or air conditioning system. Usually a sensing device is used to compare the actual state (e.g. temperature)
with a target state. Then the control system draws a conclusion what action has to be taken.
Data Download Link
Tableau Link
1. Loading the data files into HDFS
2. Content of sensor_analysis.sql file
Sagnik_AnalytixLabs_Projects
3. Tables and view created using sensor_analysis.sql
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
PROJECT 5: Upsell Analysis
Upselling is a sales technique whereby a seller induces the customer to purchase more expensive items, upgrades or other
add-ons in an attempt to make a more profitable sale.
Data Download Link
1. Sample data
2. Content of upsell_analysis.sql file
A
B
C
3. A
What is A doing?
• Concatenates first name and last name to a single field – name
• Assigns each customer a category
• Calculates the total amount spent by the customer in each category
• Order customers by the total amount spent in descending order
4. B
4.1 What is B doing?
• Extracts name from A
• Each customer is assigned his respective categories using COLLECT_LIST() function which converts
multiple rows to a single row of array datatype
• Each customer is assigned his respective amount spent on those categories
• Calculating the overall total amount spent by each customer on all categories
• Evaluating the recommended category for each customer as per the amount spent per category
4.2 Sample data of B
5. Sample data after C
PROJECT 6: Web Logs’ Analysis
An access log is a list of all the requests for individual files that people have requested from a Web site. These files will
include the HTML files and their imbedded graphic images and any other associated files that get transmitted. The access
log (sometimes referred to as the "raw data") can be analysed and summarized by another program.
Data Download Link
Tableau Link
1. Accessing apache access logs using flume
1.1 flume.conf
1.2 Extract web logs’ data using the following command:
/usr/lib/flume-ng/bin/flume-ng agent –n source_agent –c conf –f /usr/lib/flume-
ng/conf/flume.conf
2. Sample log data
3. Moving log file to HDFS
3. PIG script – log_processing.pig
3.1 Content
3.2 Execution
Sagnik_AnalytixLabs_Projects
4. Creating HIVE table on the processed log data
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects

More Related Content

Sagnik_AnalytixLabs_Projects

  • 1. PROJECT 1: Analyzing clickstream data On a Web site, clickstream analysis (sometimes called clickstream analytics) is the process of collecting, analyzing, and reporting aggregate data about which pages visitors visit in what order - which are the result of the succession of mouse clicks each visitor makes (that is, the clickstream). Download Link
  • 2. 1. Loading the data files into HDFS
  • 3. 2. Starting the new Beeline shell (hive-server 2)
  • 4. 3. Creating new database – alabs_db
  • 5. 4.Creating and loading HIVE table – users
  • 7. 5. All 3 HIVE base tables – omniturelogs, products and users created
  • 8. 6. Content of HIVE script – webanalytics.sql
  • 9. 6. Using webanalytics.sql, omniture and webanalytics tables are created
  • 17. PROJECT 2: Sentiment Analysis/Opinion Mining Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service. Data Download Link Tableau Link
  • 18. 1. Loading the data files into HDFS
  • 19. 2. Content of twitter_conf.conf file
  • 20. 3. Executing the TwitterAgent flume agent using twitter_conf.conf file
  • 21. 4. Twitter data moved to HDFS
  • 22. 5. Content of tweets.sql file
  • 25. 6. Executing tweets.sql to create tables and views for analysis
  • 26. 7. Tables and views for analysis are created
  • 28. PROJECT 3: Lending Club Loan Analysis Lending Club is a US peer-to-peer lending company. Lending Club operates an online lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the world's largest peer-to-peer lending platform. Data Download Link Tableau Link
  • 29. 1. Loading the data files into HDFS
  • 30. 2. Content of loan_analysis.sql file
  • 33. 3. Tables and view created using loan_analysis.sql
  • 39. PROJECT 4: HVAC Temperature Analysis HVAC (stands for Heating, Ventilation and Air Conditioning) equipment needs a control system to regulate the operation of a heating and/or air conditioning system. Usually a sensing device is used to compare the actual state (e.g. temperature) with a target state. Then the control system draws a conclusion what action has to be taken. Data Download Link Tableau Link
  • 40. 1. Loading the data files into HDFS
  • 41. 2. Content of sensor_analysis.sql file
  • 43. 3. Tables and view created using sensor_analysis.sql
  • 46. PROJECT 5: Upsell Analysis Upselling is a sales technique whereby a seller induces the customer to purchase more expensive items, upgrades or other add-ons in an attempt to make a more profitable sale. Data Download Link
  • 48. 2. Content of upsell_analysis.sql file
  • 49. A B C
  • 50. 3. A What is A doing? • Concatenates first name and last name to a single field – name • Assigns each customer a category • Calculates the total amount spent by the customer in each category • Order customers by the total amount spent in descending order
  • 51. 4. B 4.1 What is B doing? • Extracts name from A • Each customer is assigned his respective categories using COLLECT_LIST() function which converts multiple rows to a single row of array datatype • Each customer is assigned his respective amount spent on those categories • Calculating the overall total amount spent by each customer on all categories • Evaluating the recommended category for each customer as per the amount spent per category
  • 53. 5. Sample data after C
  • 54. PROJECT 6: Web Logs’ Analysis An access log is a list of all the requests for individual files that people have requested from a Web site. These files will include the HTML files and their imbedded graphic images and any other associated files that get transmitted. The access log (sometimes referred to as the "raw data") can be analysed and summarized by another program. Data Download Link Tableau Link
  • 55. 1. Accessing apache access logs using flume 1.1 flume.conf 1.2 Extract web logs’ data using the following command: /usr/lib/flume-ng/bin/flume-ng agent –n source_agent –c conf –f /usr/lib/flume- ng/conf/flume.conf
  • 57. 3. Moving log file to HDFS
  • 58. 3. PIG script – log_processing.pig 3.1 Content
  • 61. 4. Creating HIVE table on the processed log data