Module 1 - Introduction To Data Science
Module 1 - Introduction To Data Science
B. Tech
Department of Information Technology
Walchand College of Engineering, Sangli
Module 1
Introduction to Data Science
Course Instructor: Prof. Tahseen A. Mulla
Module Overview
STATISTICAL DATA
RESEARCH PROCESSING
DOMAIN
EXPERTISE
Data Science involves extracting insights from vast amounts of data by the use of various
scientific methods, algorithms and processes
The term “Data Science” has emerged because of the evolution of mathematical statistics,
data analysis and big data
Data Science is an interdisciplinary field that allows to extract knowledge from structured and
unstructured data
Data Science enables to translate a business problem into a research project and then
translate it back into a practical solution
Data Flow
Unstructured Data Decision Making
Data
Data Storage Prediction
Science
Lack of Predictive Analysis Pattern Discovery
Data is the oil in today’s world. With right tools, technologies, algorithms we can
use data and convert it into a distinctive business advantage
Communicate Data
Results Preparation
Model
Operation
Planning
Model
Building
It involves acquiring data from all the identified internal and external sources
which helps you to answer the business question
Data can have lots of inconsistencies like missing value, blank columns, incorrect
data format which needs to be cleaned
This kind of data has to be prepared by processing, exploring and condition data
before modeling
It determines the method and technique to draw the relation between input
variables
SQL analysis services, R, and SAS/access are some of the tools used for this
purpose
It delivers the final base-lined model with reports, code and technical
documents
This helps to decide if the results of the project are a success or a failure based
on the inputs from the model
Goal Metric
• The goal is to predict if a • The score is the percentage of
passenger survived the sinking passengers correctly predicted
of the Titanic or not
• For each test set, we predict a 0
or 1 value for the variable
C = Cherbourg
Q = Queenstown
S = Southampton
For most of the heavy computational tasks, C, C++ and FORTRAN codes are
preferred
Visits to R language –
Based on visits to Stack Overflow
questions from the US/UK for last two
years
Statistical inference
Data analysis
The primary use of ‘R’ is and will always be statistics, visualization and machine
learning
All the libraries of R, almost 12K are stored in CRAN (Comprehensive R Archive
Network)
ggplot2
dplyr
data.table
shiny
plyr
tidyr
Source Variable
Editor Explorer
Plot
Window
Console
Window
PROF. TAHSEEN A. MULLA MODULE_1 - INTRODUCTION_TO_DATA_SCIENCE WALCHAND_COLLEGE_OF_ENGINEERING, SANGLI 54
Rcommander (Rcmdr)
Script
Window
Output
Window
Features
Can communicate with R
Support to Latex
Handles big files
Works on MDI (Multiple Document Interface)
UNICODE support
Portable
Has different templates to R
Features –
Spreadsheet like data editor
Syntax highlighting
Data import (eg: CSV)
Plot preview
R package management
GUI dialogs for statistics and plots
Type in Rstudio
n floor(rnorm(1000,500,100))
t table(n)
barplot(t)