Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Predicting Tag of Questions & Data
Science Job Required Skill Analysis
STEVENS INSTITUTE OF TECHNOLOGY, SPRING 2017, CS - 513
1
Projects By:
Priya Parmar
10412380
Ruchika Sutariya
10418975
Harsh Kevadia
10420312
2
Agenda
 Problem Statement
 Objective
 Project Flow
 Data Scraping
 DataSet After Scraping
 Cleaning Data
 Classification Models
 Conclusion
3
What is Career Cup ?
 CareerCup helps people prepare for jobs at tech companies.
 Unlike other types of interviews, technical interviews are intensely skill
based.
 So what CareerCup does -- it offers you ways of studying for an interview.
 You can ask questions for the interview prep.
 You can post questions that were asked to you in your interview to help others.
4
Problem Statement
 Some users don’t put tags in their questions.
 This leads to questions with ambiguous categories.
 It becomes a cumbersome process to decide what questions belong to
which category manually leading to increase in human work load.
 There has to be a way to categorize these questions that don’t have
tags/categories.
5
Objective
 We are focusing on predicting category of questions based on their
properties.
 We are considering previous questions, their votes and their tags for the
categorization of those questions.
 The main aim of this project is to predict the category of questions by
using the previously categorized questions leading to less human work
load.
6
An Example .. 7
Project Flow 8
Understanding the Problem Data Scraping
Data CleaningAlgorithm Selection
Test Different Classification Models Compare Model Accuracy
Best Model Selection
Data Scraping
 Web sites are written using HTML, which means that each web page is a
structured document.
 Data scraping is the practice of using a computer program to sift through
a web page and gather the data that you need in a format most useful to
you while at the same time preserving the structure of the data.
 So we used a Python program to scrap through the CareerCup website to
get a real time dataset.
9
After Scraping Data
 The format that we get after scraping:
TAG t VOTE t Question
 Example:
 algorithm 3 Given the root of a Binary Tree along with two integer values. Assume that
both integers are present in the tree. Find the LCA (Least Common Ancestor) of the two nodes
with values of the given integers. 2 pass solution is easy. You must solve this in a single pass.
10
TAG VOTE Question
Cleaning Data
 Removing the questions with negative votes to improve the quality of the
dataset.
 For each question, we use text mining to remove stop words (like is, the,
etc) from the question.
11
Process - Convert Raw Questions to
the matrix
12
Extracting words
 Words are extracted in two ways:
1. Question is split by spaces to get individual words.
2. A fixed length of 5 characters is used to define a word.
13
Classification Models Used
1. KNN
2. Decision Tree
3. Logistic Regression
4. Random Forest
5. Naive Bayesian
6. ANN
14
KNN
 Supervised method - Where a target variable is specified – The algorithm
“learns” from the examples by determining which values of the predictor
variables are associated with different values of the target variable.
 K-nearest neighbors is a simple algorithm that stores all variable cases and
classifies new cases based on a similarity measure.
 It has been used stastically estimation.
15
KNN Parameters 16
Output
 {'n_neighbors': 1, 'weights': 'uniform'} 0.7670329670329671
 {'n_neighbors': 1, 'weights': 'distance'} 0.7670329670329671
 {'n_neighbors': 3, 'weights': 'uniform'} 0.7772893772893773
 {'n_neighbors': 3, 'weights': 'distance'} 0.7714285714285715
 {'n_neighbors': 5, 'weights': 'uniform'} 0.8131868131868132
 {'n_neighbors': 5, 'weights': 'distance'} 0.8043956043956044
 {'n_neighbors': 7, 'weights': 'uniform'} 0.8278388278388278
 {'n_neighbors': 7, 'weights': 'distance'} 0.819047619047619
 {'n_neighbors': 9, 'weights': 'uniform'} 0.8263736263736263
 {'n_neighbors': 11, 'weights': 'uniform'} 0.8293040293040294
 {'n_neighbors': 13, 'weights': 'distance'} 0.8197802197802198
 {'n_neighbors': 15, 'weights': 'uniform'} 0.8336996336996337
 {'n_neighbors': 17, 'weights': 'distance'} 0.8219780219780219
17
Accuracy:
0.8336996336996337
Decision Tree 18
Decision Tree 19
DT Parameters 20
Output
 {'criterion': 'gini', 'max_depth': 3} 0.8586080586080586
 {'criterion': 'gini', 'max_depth': 4} 0.8622710622710623
 {'criterion': 'gini', 'max_depth': 5} 0.8666666666666667
 {'criterion': 'gini', 'max_depth': 6} 0.8593406593406593
 {'criterion': 'gini', 'max_depth': 7} 0.8578754578754578
 {'criterion': 'gini', 'max_depth': 8} 0.8556776556776556
 {'criterion': 'entropy', 'max_depth': 3} 0.8549450549450549
 {'criterion': 'entropy', 'max_depth': 4} 0.863003663003663
 {'criterion': 'entropy', 'max_depth': 5} 0.8695970695970696
 {'criterion': 'entropy', 'max_depth': 6} 0.8681318681318682
 {'criterion': 'entropy', 'max_depth': 7} 0.8652014652014652
 {'criterion': 'entropy', 'max_depth': 8} 0.8593406593406593
21
Accuracy:
0.8695970695970696
Logistic Regression
 It is a classification method that generalizes logistic regression to
multiclass problems, i.e. with more than two possible discrete outcomes
 It is used to predict the probabilities of the different possible outcomes of
a categorically distributed dependent variable, given a set of independent
variables
22
LR Parameters 23
Output
 {'C': 0.5, 'penalty': 'l1'} 0.8293040293040294
 {'C': 0.5, 'penalty': 'l2'} 0.8307692307692308
 {'C': 1, 'penalty': 'l1'} 0.8315018315018315
 {'C': 1, 'penalty': 'l2'} 0.8315018315018315
 {'C': 1.5, 'penalty': 'l1'} 0.8300366300366301
 {'C': 1.5, 'penalty': 'l2'} 0.832967032967033
 {'C': 2, 'penalty': 'l1'} 0.8278388278388278
 {'C': 2, 'penalty': 'l2'} 0.8322344322344323
24
Accuracy:
0.832967032967033
Random Forest
 Random forest (or random forests) is an ensemble classifier that consists of
many decision trees and outputs the class that is the mode of the class's
output by individual trees.
 It Operates by constructing many decision trees
25
RF Parameters 26
Output
 {'n_estimators': 500, 'criterion': 'gini'} 0.8468864468864469
 {'n_estimators': 1000, 'criterion': 'gini'} 0.8498168498168498
 {'n_estimators': 1500, 'criterion': 'gini'} 0.8468864468864469
 {'n_estimators': 2000, 'criterion': 'gini'} 0.8461538461538461
 {'n_estimators': 2500, 'criterion': 'gini'} 0.8490842490842491
 {'n_estimators': 3000, 'criterion': 'gini'} 0.8476190476190476
 {'n_estimators': 500, 'criterion': 'entropy'} 0.8446886446886447
 {'n_estimators': 1000, 'criterion': 'entropy'} 0.8454212454212454
 {'n_estimators': 1500, 'criterion': 'entropy'} 0.8424908424908425
 {'n_estimators': 2000, 'criterion': 'entropy'} 0.8454212454212454
 {'n_estimators': 2500, 'criterion': 'entropy'} 0.8454212454212454
 {'n_estimators': 3000, 'criterion': 'entropy'} 0.8424908424908425
27
Accuracy:
0.8498168498168498
Naive Bayesian
 It is a classification technique based on Bayes' Theorem with an
assumption of independence among predictors.
 In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other
feature.
 A Naive Bayesian model is easy to build, with no complicated iterative
parameter estimation which makes it particularly useful for very large
datasets.
28
NB Parameters 29
Output
{'alpha': 0, 'fit_prior': True} 0.7970695970695971
{'alpha': 0, 'fit_prior': False} 0.7919413919413919
{'alpha': 1.0, 'fit_prior': True} 0.7934065934065934
{'alpha': 1.0, 'fit_prior': False} 0.7736263736263737
{'alpha': 5.0, 'fit_prior': True} 0.802930402930403
{'alpha': 5.0, 'fit_prior': False} 0.8131868131868132
{'alpha': 10.0, 'fit_prior': True} 0.780952380952381
{'alpha': 10.0, 'fit_prior': False} 0.7992673992673993
{'alpha': 20.0, 'fit_prior': True} 0.7736263736263737
{'alpha': 20.0, 'fit_prior': False} 0.7846153846153846
{'alpha': 30.0, 'fit_prior': True} 0.7743589743589744
{'alpha': 30.0, 'fit_prior': False} 0.7846153846153846
{'alpha': 40.0, 'fit_prior': True} 0.7743589743589744
{'alpha': 40.0, 'fit_prior': False} 0.7831501831501831
30
Accuracy:
0.8131868131868132
ANN - Artificial Neural Networks 31
Accuracy:
0.916256157635
Highest
Accuracy
Comparison of Classification models 32
Conclusion
 We have explored different prediction models. By measuring the
performance of the models using real data, we have seen interesting
results on the predictability of the category of questions.
 We found out that ANN has the maximum accuracy for predicting the
Career Cup dataset.
33
Future Scope
 Currently we are using 4 categories for prediction. But in future, this can be
extended to hundreds of categories with a satisfactory accuracy.
 In future, we can add a categorization for the question according to the
company as well. For example, what kind of questions are asked for
Amazon or Google could be predicted.
34
References
 http://www.careercup.com
 https://www.wikipedia.org/
35
Data Science Job
Required Skill Analysis
36
Data Science Job Analysis - Monster.com
 Total Jobs: 450
 Total Python Skill Jobs: 200
 Python Percentage: 44.44%
 Total Big Data Skill Jobs: 153
 Big Data Percentage: 34.00%
 Total SAS Skill Jobs: 61
 SAS Percentage: 13.56%
 Total R Skill Jobs: 128
 R Percentage: 28.44%
 Total Machine Learning Skill Jobs: 153
 Machine Learning Percentage: 34.00%
37
Result: 38
Thank You
Question & Answer
39

More Related Content

Similar to Data Science Job Required Skill Analysis

Instance Based Learning in machine learning
Instance Based Learning in machine learningInstance Based Learning in machine learning
Instance Based Learning in machine learning
tanishqgujari
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
sriram30691
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Vikas Virani
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine Learning
All Things Open
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithms
rahmedraj93
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
Sheetal Gangakhedkar
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
Shivammittal880395
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
Naoki Shibata
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
John Edward Slough II
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
Sơn Còm Nhom
 
Calc 6-1.ppt
Calc 6-1.pptCalc 6-1.ppt
Calc 6-1.ppt
KiranKumar30001
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3
Andrew Hair
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Dessy Amirudin
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
SaravanakumarSekar4
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
11. Linear Models
11. Linear Models11. Linear Models
11. Linear Models
FAO
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
Vasudev pendyala
 

Similar to Data Science Job Required Skill Analysis (20)

Instance Based Learning in machine learning
Instance Based Learning in machine learningInstance Based Learning in machine learning
Instance Based Learning in machine learning
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 
Using Open Source Tools for Machine Learning
Using Open Source Tools for Machine LearningUsing Open Source Tools for Machine Learning
Using Open Source Tools for Machine Learning
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Different Types of Machine Learning Algorithms
Different Types of Machine Learning AlgorithmsDifferent Types of Machine Learning Algorithms
Different Types of Machine Learning Algorithms
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Calc 6-1.ppt
Calc 6-1.pptCalc 6-1.ppt
Calc 6-1.ppt
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
11. Linear Models
11. Linear Models11. Linear Models
11. Linear Models
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 

Recently uploaded

Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
BRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slipBRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slip
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
Lucien Maxwell
 
IOT NOTES BASED ON THE ENGINEERING ACADEMICS
IOT NOTES BASED ON THE ENGINEERING ACADEMICSIOT NOTES BASED ON THE ENGINEERING ACADEMICS
IOT NOTES BASED ON THE ENGINEERING ACADEMICS
sunejakatkar1
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_
AmanTiwari297384
 
Tailoring a Seamless Data Warehouse Architecture
Tailoring a Seamless Data Warehouse ArchitectureTailoring a Seamless Data Warehouse Architecture
Tailoring a Seamless Data Warehouse Architecture
GetOnData
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
da42ki0
 
INTRODUCTION TO BIG DATA ANALYTICS.pptx
INTRODUCTION TO  BIG DATA ANALYTICS.pptxINTRODUCTION TO  BIG DATA ANALYTICS.pptx
INTRODUCTION TO BIG DATA ANALYTICS.pptx
Preethi G
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024
Vietnam Cotton & Spinning Association
 
Toward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive ComputingToward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive Computing
Larry Smarr
 
BGTUG Meeting Q3 2024 - Get Ready for Summer
BGTUG Meeting Q3 2024 - Get Ready for SummerBGTUG Meeting Q3 2024 - Get Ready for Summer
BGTUG Meeting Q3 2024 - Get Ready for Summer
Stanislava Tropcheva
 
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeNYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
Timothy Spann
 
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Larry Smarr
 
Flow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptxFlow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptx
DannyInfante1
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf
DaraputriOktiara
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Databricks Vs Snowflake off Page PDF submission.pptx
Databricks Vs Snowflake off Page PDF submission.pptxDatabricks Vs Snowflake off Page PDF submission.pptx
Databricks Vs Snowflake off Page PDF submission.pptx
dewsharon760
 

Recently uploaded (20)

Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
BRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slipBRIGADA eskwela  2024 slip  BRIGADA eskwela  2024 slip
BRIGADA eskwela 2024 slip BRIGADA eskwela 2024 slip
 
IOT NOTES BASED ON THE ENGINEERING ACADEMICS
IOT NOTES BASED ON THE ENGINEERING ACADEMICSIOT NOTES BASED ON THE ENGINEERING ACADEMICS
IOT NOTES BASED ON THE ENGINEERING ACADEMICS
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_Module-4_Docker_Training Course outline_
Module-4_Docker_Training Course outline_
 
Tailoring a Seamless Data Warehouse Architecture
Tailoring a Seamless Data Warehouse ArchitectureTailoring a Seamless Data Warehouse Architecture
Tailoring a Seamless Data Warehouse Architecture
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
 
INTRODUCTION TO BIG DATA ANALYTICS.pptx
INTRODUCTION TO  BIG DATA ANALYTICS.pptxINTRODUCTION TO  BIG DATA ANALYTICS.pptx
INTRODUCTION TO BIG DATA ANALYTICS.pptx
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics July 2024
 
Toward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive ComputingToward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive Computing
 
BGTUG Meeting Q3 2024 - Get Ready for Summer
BGTUG Meeting Q3 2024 - Get Ready for SummerBGTUG Meeting Q3 2024 - Get Ready for Summer
BGTUG Meeting Q3 2024 - Get Ready for Summer
 
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeNYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
 
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
 
Flow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptxFlow Diagram Infographics by Slidesgo.pptx
Flow Diagram Infographics by Slidesgo.pptx
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf004_Cybersecurity Fundamentals Network Security.pdf
004_Cybersecurity Fundamentals Network Security.pdf
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Databricks Vs Snowflake off Page PDF submission.pptx
Databricks Vs Snowflake off Page PDF submission.pptxDatabricks Vs Snowflake off Page PDF submission.pptx
Databricks Vs Snowflake off Page PDF submission.pptx
 

Data Science Job Required Skill Analysis

  • 1. Predicting Tag of Questions & Data Science Job Required Skill Analysis STEVENS INSTITUTE OF TECHNOLOGY, SPRING 2017, CS - 513 1
  • 2. Projects By: Priya Parmar 10412380 Ruchika Sutariya 10418975 Harsh Kevadia 10420312 2
  • 3. Agenda  Problem Statement  Objective  Project Flow  Data Scraping  DataSet After Scraping  Cleaning Data  Classification Models  Conclusion 3
  • 4. What is Career Cup ?  CareerCup helps people prepare for jobs at tech companies.  Unlike other types of interviews, technical interviews are intensely skill based.  So what CareerCup does -- it offers you ways of studying for an interview.  You can ask questions for the interview prep.  You can post questions that were asked to you in your interview to help others. 4
  • 5. Problem Statement  Some users don’t put tags in their questions.  This leads to questions with ambiguous categories.  It becomes a cumbersome process to decide what questions belong to which category manually leading to increase in human work load.  There has to be a way to categorize these questions that don’t have tags/categories. 5
  • 6. Objective  We are focusing on predicting category of questions based on their properties.  We are considering previous questions, their votes and their tags for the categorization of those questions.  The main aim of this project is to predict the category of questions by using the previously categorized questions leading to less human work load. 6
  • 8. Project Flow 8 Understanding the Problem Data Scraping Data CleaningAlgorithm Selection Test Different Classification Models Compare Model Accuracy Best Model Selection
  • 9. Data Scraping  Web sites are written using HTML, which means that each web page is a structured document.  Data scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.  So we used a Python program to scrap through the CareerCup website to get a real time dataset. 9
  • 10. After Scraping Data  The format that we get after scraping: TAG t VOTE t Question  Example:  algorithm 3 Given the root of a Binary Tree along with two integer values. Assume that both integers are present in the tree. Find the LCA (Least Common Ancestor) of the two nodes with values of the given integers. 2 pass solution is easy. You must solve this in a single pass. 10 TAG VOTE Question
  • 11. Cleaning Data  Removing the questions with negative votes to improve the quality of the dataset.  For each question, we use text mining to remove stop words (like is, the, etc) from the question. 11
  • 12. Process - Convert Raw Questions to the matrix 12
  • 13. Extracting words  Words are extracted in two ways: 1. Question is split by spaces to get individual words. 2. A fixed length of 5 characters is used to define a word. 13
  • 14. Classification Models Used 1. KNN 2. Decision Tree 3. Logistic Regression 4. Random Forest 5. Naive Bayesian 6. ANN 14
  • 15. KNN  Supervised method - Where a target variable is specified – The algorithm “learns” from the examples by determining which values of the predictor variables are associated with different values of the target variable.  K-nearest neighbors is a simple algorithm that stores all variable cases and classifies new cases based on a similarity measure.  It has been used stastically estimation. 15
  • 17. Output  {'n_neighbors': 1, 'weights': 'uniform'} 0.7670329670329671  {'n_neighbors': 1, 'weights': 'distance'} 0.7670329670329671  {'n_neighbors': 3, 'weights': 'uniform'} 0.7772893772893773  {'n_neighbors': 3, 'weights': 'distance'} 0.7714285714285715  {'n_neighbors': 5, 'weights': 'uniform'} 0.8131868131868132  {'n_neighbors': 5, 'weights': 'distance'} 0.8043956043956044  {'n_neighbors': 7, 'weights': 'uniform'} 0.8278388278388278  {'n_neighbors': 7, 'weights': 'distance'} 0.819047619047619  {'n_neighbors': 9, 'weights': 'uniform'} 0.8263736263736263  {'n_neighbors': 11, 'weights': 'uniform'} 0.8293040293040294  {'n_neighbors': 13, 'weights': 'distance'} 0.8197802197802198  {'n_neighbors': 15, 'weights': 'uniform'} 0.8336996336996337  {'n_neighbors': 17, 'weights': 'distance'} 0.8219780219780219 17 Accuracy: 0.8336996336996337
  • 21. Output  {'criterion': 'gini', 'max_depth': 3} 0.8586080586080586  {'criterion': 'gini', 'max_depth': 4} 0.8622710622710623  {'criterion': 'gini', 'max_depth': 5} 0.8666666666666667  {'criterion': 'gini', 'max_depth': 6} 0.8593406593406593  {'criterion': 'gini', 'max_depth': 7} 0.8578754578754578  {'criterion': 'gini', 'max_depth': 8} 0.8556776556776556  {'criterion': 'entropy', 'max_depth': 3} 0.8549450549450549  {'criterion': 'entropy', 'max_depth': 4} 0.863003663003663  {'criterion': 'entropy', 'max_depth': 5} 0.8695970695970696  {'criterion': 'entropy', 'max_depth': 6} 0.8681318681318682  {'criterion': 'entropy', 'max_depth': 7} 0.8652014652014652  {'criterion': 'entropy', 'max_depth': 8} 0.8593406593406593 21 Accuracy: 0.8695970695970696
  • 22. Logistic Regression  It is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes  It is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables 22
  • 24. Output  {'C': 0.5, 'penalty': 'l1'} 0.8293040293040294  {'C': 0.5, 'penalty': 'l2'} 0.8307692307692308  {'C': 1, 'penalty': 'l1'} 0.8315018315018315  {'C': 1, 'penalty': 'l2'} 0.8315018315018315  {'C': 1.5, 'penalty': 'l1'} 0.8300366300366301  {'C': 1.5, 'penalty': 'l2'} 0.832967032967033  {'C': 2, 'penalty': 'l1'} 0.8278388278388278  {'C': 2, 'penalty': 'l2'} 0.8322344322344323 24 Accuracy: 0.832967032967033
  • 25. Random Forest  Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees.  It Operates by constructing many decision trees 25
  • 27. Output  {'n_estimators': 500, 'criterion': 'gini'} 0.8468864468864469  {'n_estimators': 1000, 'criterion': 'gini'} 0.8498168498168498  {'n_estimators': 1500, 'criterion': 'gini'} 0.8468864468864469  {'n_estimators': 2000, 'criterion': 'gini'} 0.8461538461538461  {'n_estimators': 2500, 'criterion': 'gini'} 0.8490842490842491  {'n_estimators': 3000, 'criterion': 'gini'} 0.8476190476190476  {'n_estimators': 500, 'criterion': 'entropy'} 0.8446886446886447  {'n_estimators': 1000, 'criterion': 'entropy'} 0.8454212454212454  {'n_estimators': 1500, 'criterion': 'entropy'} 0.8424908424908425  {'n_estimators': 2000, 'criterion': 'entropy'} 0.8454212454212454  {'n_estimators': 2500, 'criterion': 'entropy'} 0.8454212454212454  {'n_estimators': 3000, 'criterion': 'entropy'} 0.8424908424908425 27 Accuracy: 0.8498168498168498
  • 28. Naive Bayesian  It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors.  In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.  A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. 28
  • 30. Output {'alpha': 0, 'fit_prior': True} 0.7970695970695971 {'alpha': 0, 'fit_prior': False} 0.7919413919413919 {'alpha': 1.0, 'fit_prior': True} 0.7934065934065934 {'alpha': 1.0, 'fit_prior': False} 0.7736263736263737 {'alpha': 5.0, 'fit_prior': True} 0.802930402930403 {'alpha': 5.0, 'fit_prior': False} 0.8131868131868132 {'alpha': 10.0, 'fit_prior': True} 0.780952380952381 {'alpha': 10.0, 'fit_prior': False} 0.7992673992673993 {'alpha': 20.0, 'fit_prior': True} 0.7736263736263737 {'alpha': 20.0, 'fit_prior': False} 0.7846153846153846 {'alpha': 30.0, 'fit_prior': True} 0.7743589743589744 {'alpha': 30.0, 'fit_prior': False} 0.7846153846153846 {'alpha': 40.0, 'fit_prior': True} 0.7743589743589744 {'alpha': 40.0, 'fit_prior': False} 0.7831501831501831 30 Accuracy: 0.8131868131868132
  • 31. ANN - Artificial Neural Networks 31 Accuracy: 0.916256157635 Highest Accuracy
  • 33. Conclusion  We have explored different prediction models. By measuring the performance of the models using real data, we have seen interesting results on the predictability of the category of questions.  We found out that ANN has the maximum accuracy for predicting the Career Cup dataset. 33
  • 34. Future Scope  Currently we are using 4 categories for prediction. But in future, this can be extended to hundreds of categories with a satisfactory accuracy.  In future, we can add a categorization for the question according to the company as well. For example, what kind of questions are asked for Amazon or Google could be predicted. 34
  • 36. Data Science Job Required Skill Analysis 36
  • 37. Data Science Job Analysis - Monster.com  Total Jobs: 450  Total Python Skill Jobs: 200  Python Percentage: 44.44%  Total Big Data Skill Jobs: 153  Big Data Percentage: 34.00%  Total SAS Skill Jobs: 61  SAS Percentage: 13.56%  Total R Skill Jobs: 128  R Percentage: 28.44%  Total Machine Learning Skill Jobs: 153  Machine Learning Percentage: 34.00% 37
  • 39. Thank You Question & Answer 39