Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
2K views

Module 1 - Introduction To Data Science

This document outlines a module on introduction to data science. It discusses key topics that will be covered, including an overview of the data science process and introduction to machine learning techniques like regression, classification, and clustering. It also provides examples of the data science process steps and tools used in data science, highlighting case studies on using data science in healthcare and to analyze the Titanic disaster.

Uploaded by

Raushan Kashyap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views

Module 1 - Introduction To Data Science

This document outlines a module on introduction to data science. It discusses key topics that will be covered, including an overview of the data science process and introduction to machine learning techniques like regression, classification, and clustering. It also provides examples of the data science process steps and tools used in data science, highlighting case studies on using data science in healthcare and to analyze the Titanic disaster.

Uploaded by

Raushan Kashyap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Visualization and Interpretation

B. Tech
Department of Information Technology
Walchand College of Engineering, Sangli

Module 1
Introduction to Data Science
Course Instructor: Prof. Tahseen A. Mulla
Module Overview

 Introduction to Data Science


 Overview of the Data Science Process
 Introduction to Data Science Technologies
 Introduction to Machine Learning
 Regressions
 Classification
 Clustering
 Recommendation
PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 2
MACHINE
LEARNING
MATHEMATICS COMPUTER
SCIENCE
DATA
SCIENCE

STATISTICAL DATA
RESEARCH PROCESSING

DOMAIN
EXPERTISE

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 3


Introduction to Data Science

 Data Science involves extracting insights from vast amounts of data by the use of various
scientific methods, algorithms and processes

 The term “Data Science” has emerged because of the evolution of mathematical statistics,
data analysis and big data

 Data Science is an interdisciplinary field that allows to extract knowledge from structured and
unstructured data

 Data Science enables to translate a business problem into a research project and then
translate it back into a practical solution

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 4


Revolution of Data Science

Data Flow
Unstructured Data Decision Making
Data
Data Storage Prediction
Science
Lack of Predictive Analysis Pattern Discovery

Lack of Scientific Insights

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 5


Significant advantages of using “Data Analytics Technology”

 Data is the oil in today’s world. With right tools, technologies, algorithms we can
use data and convert it into a distinctive business advantage

 It can help detect fraud using advanced machine learning algorithms

 It allows to build intelligence ability in machines

 It enables to take better and faster decisions

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 6


Data Science
Components

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 7


Data Science Process Discovery

Communicate Data
Results Preparation

Model
Operation
Planning

Model
Building

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 8


Data Science Process - Discovery

 It involves acquiring data from all the identified internal and external sources
which helps you to answer the business question

 The data can be –


 Logs from web servers
 Data gathered from social media
 Census datasets
 Data streamed from online sources using API’s

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 9


Logs from Web Servers

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 10


Data gathered from social media

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 11


Census datasets

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 12


Data streamed from online sources using API’s

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 13


Data Science Process – Data Preparation

 Data can have lots of inconsistencies like missing value, blank columns, incorrect
data format which needs to be cleaned

 This kind of data has to be prepared by processing, exploring and condition data
before modeling

The cleaner your data, the better your predictions are

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 14


Data Science Process – Model Planning

 It determines the method and technique to draw the relation between input
variables

 Planning for a model is performed by using different statistical formulas and


visualization tools

 SQL analysis services, R, and SAS/access are some of the tools used for this
purpose

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 15


Data Science Process – Model Building

 Here the actual model building process starts

 Data scientists distributes datasets for training and testing

 Techniques like association, classification and clustering are applied to the


training data set

 The model once prepared is tested against the “testing” dataset

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 16


Data Science Process - Operationalize

 It delivers the final base-lined model with reports, code and technical
documents

 Model is deployed into a real-time production environment after through


testing

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 17


Data Science Process – Communicate Results

 Here the key findings are communicated to all the stakeholders

 This helps to decide if the results of the project are a success or a failure based
on the inputs from the model

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 18


Data Science Job Roles

 Most prominent Data Scientist job titles –


 Data Scientist – manages enormous amount of data to compel business
 Data Engineer – develops, constructs, tests and maintains architectures for large
scale
 Data Analyst – responsible for mining vast amounts of data
 Statistician – collects, analyses, understands quantitative and qualitative data
 Data Architect – designs the flow of the architecture
 Data Admin – ensures that the database is accessible to all relevant users
 Business Analyst – improves business processes
 Data/Analytics Manager
PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 19
Tools for Data Science

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 20


What to do with these data?

Aggregation and Statistics


Indexing, Searching and Querying
- Data
Warehousing - Keyword Knowledge Discovery
based search - Data mining Data Driven
- OLAP - Predictive
- Pattern Analysis
- Statistical
matching modeling
-Deep
Learning

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 21


Data Science: Case Study for Cancer patients

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 22


Data Science: Case Study for Titanic (Machine Learning from Disaster)

The thought that disaster is impossible


often leads to an unthinkable disaster

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 23


Case Study Description: Titanic

On April 15, 1912,


This sensational
during her maiden
The sinking of Titanic tragedy shocked the
voyage, the Titanic
is one of the most international
sank after colliding
significant shipwrecks community and led to
with an iceberg, killing
in the history better safety
1502 out of 2224
regulations for ships
passengers and crew

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 24


Case Study: Titanic

Goal Metric
• The goal is to predict if a • The score is the percentage of
passenger survived the sinking passengers correctly predicted
of the Titanic or not
• For each test set, we predict a 0
or 1 value for the variable

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 25


Case Study: Titanic

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 26


Case Study: Titanic

Gender Submission Data


File name: gender_submission.csv

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 27


Case Study: Titanic Contains data to check for accuracy on unseen data
File name: test.csv

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 28


Case Study: Titanic Contains data to build Machine Learning models
File name: train.csv

C = Cherbourg
Q = Queenstown
S = Southampton

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 29


Introduction to R programming
language

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 30


R programming language

 It was developed by Ross Ihaka and Robert Gentleman in 1993

 R possesses an extensive catalog of statistical and graphical methods

 R includes Machine Learning algorithm, Linear regression, time series and


statistical interfaces

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 31


R programming language

 Most of the R libraries are written in R language

 For most of the heavy computational tasks, C, C++ and FORTRAN codes are
preferred

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 32


RStudio

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 33


RStudio – Source Editor

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 34


RStudio – Console Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 35


RStudio – Env/History/Conn Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 36


RStudio – File/Plots/…

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 37


Installing package in ‘R’

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 38


R entrusted by…

Visits to R language –
Based on visits to Stack Overflow
questions from the US/UK for last two
years

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 39


Data Analysis with R

 Series of steps for data analysis –

 Programming (clear and accessible)

 Transform (collection of libraries for data science)

 Discover (investigate data, refine hypothesis and analyze it)

 Model (capture the right model for the data)

 Communicate (integrate codes, graphs and outputs to a report)

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 40


What is R used for?

 Basically R is used for –

 Statistical inference

 Data analysis

 Machine Learning algorithms

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 41


R package

 The primary use of ‘R’ is and will always be statistics, visualization and machine
learning

 All the libraries of R, almost 12K are stored in CRAN (Comprehensive R Archive
Network)

 CRAN is an open source and free

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 42


Most mentioned R packages in last
two years

ggplot2
dplyr
data.table
shiny
plyr
tidyr

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 43


Different types of plots in R

 Scatter plot  Line chart


 Box-and-whisker plot  Pie chart
 Histogram
 Matrix of scatterplot
 Density plot
 Dot plot
 Bar chart

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 44


Tools available for ‘R’

 R tools for Visual Studio


 StatET
 R Analytic Flow
 RCommander
 Tinn-R
 RKWard

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 45


R Tools for Visual Studio – Interactive Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 46


R Tools for Visual Studio – Debug Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 47


R Tools for Visual Studio – History Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 48


R Tools for Visual Studio – Plot Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 49


R Tools for Visual Studio – Variable Explorer Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 50


R Tools for Visual Studio – Enhanced Interpreters Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 51


StatET
Debug
Window

Source Variable
Editor Explorer

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 52


R Analytic Flow

 It is a data analysis tool that utilizes the R


environment for statistical computing

 It works on Windows, Mac, Linux and is free for


any use

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 53


R Analytic Flow Analysis Flow
Window

Plot
Window

Console
Window
PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 54
Rcommander (Rcmdr)

Script
Window

Output
Window

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 55


Tinn-R

 Features
 Can communicate with R
 Support to Latex
 Handles big files
 Works on MDI (Multiple Document Interface)
 UNICODE support
 Portable
 Has different templates to R

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 56


Tinn-R

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 57


RKWard

 Features –
 Spreadsheet like data editor
 Syntax highlighting
 Data import (eg: CSV)
 Plot preview
 R package management
 GUI dialogs for statistics and plots

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 58


Example for R on plots

 Type in Rstudio

n  floor(rnorm(1000,500,100))
t  table(n)
barplot(t)

PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 59

You might also like