Data Science Using Python - Day 1-2
Data Science Using Python - Day 1-2
Today’s Agenda
Data science employs techniques and theories drawn from a wide range of disciplines like Mathematics,
Statistics, Information Science, and Computer Science, in particular from the subdomains of Machine learning,
Classification, Association, Cluster analysis, Data mining, forecasting and Visualization in order to understand
and analyze actual phenomena with data.
▪ Regression Analysis – Finding the relationship between a dependent variable and one or
more independent variables - Predicting Diamond price based on Carat, Cut & Clarity
▪ Classification Analysis – Dividing objects into 2 or more known classes - Distinguishing cancer and
normal cells
▪ Anomalies Detection(Outliers Analysis) - Finding unusual - Credit card transactions
▪ Association Analysis – Finding links - Shopping cart analysis
▪ Cluster Analysis (Segmentation) – Grouping similar objects together - Grouping customers into
different clusters based on their previous shopping data/transactions.
▪ Time Series Analysis – Time dependent Data - Stock prediction
Minimum Criteria :
✓ Basic understanding of programming concepts like PL/SQL,C, C++, Java & RDBMS.
✓ Basic knowledge of Mathematics and Statistics Concepts.
✓ Basic Knowledge of Reporting or Visualization like Tableau, Spotfire, SAP Business Objects or any other
reporting tools.
Homework
▪ Do some more Research on What is Data Science & Machine Learning
https://www.the-modeling-agency.com/crisp-dm.pdf
Today’s Agenda
What is CRISP-DM ?
CRISP-DM was conceived in late 1996 by three veterans of the young and
immature data mining market. CRISP Stands for “CRoss-Industry Standard
Process for Data Mining”
This Process model for data mining provides an overview of the life cycle of a
data mining project. It contains all the phases of a project, their respective tasks,
and the relationships between these tasks. Relationships could exist between any
data mining tasks depending on the goals, the background, and the interest of the
user–and most importantly–on the data.
CRISP-DM Phases
The life cycle of a data mining project consists of six phases as shown –
2. Data understanding -The data understanding phase starts with initial data collection and
proceeds with activities that enable you to become familiar with the data.
3. Data preparation - The data preparation phase covers all activities needed to construct
the final dataset or data that will be fed into the modeling tool(s) from the initial raw data. Tasks
include table, record, and attribute selection, as well as transformation and cleaning of data for
modeling tools.
CRISP-DM Phases
4. Modeling - In this phase, various modeling techniques are selected and applied,
and their parameters are calibrated to optimal values.
5. Evaluation - At this stage you have built a model (or models) that appears to
have high quality from a data analysis perspective. Before proceeding to final
deployment of the model, it is important to thoroughly evaluate it and review the
steps executed to create it, to be certain the model properly achieves the business
objectives. A key objective is to determine if there is some important business issue
that has not been sufficiently considered.
CRISP-DM Phases
https://www.the-modeling-agency.com/crisp-dm.pdf
Basics Of Statistics
What is Statistics?
Statistics is a branch of Mathematics dealing with the collection, analysis,
interpretation, presentation, and organization of data.
Random variable
A Variable which is used to store value corresponding to each outcome of a Random Experiment/Event/Activity. Ex. Coin Flip, RV=
{H,T}
Basics Of Statistics
https://en.wikipedia.org/wiki/Mode_(statistics)
Basics Of Statistics
Basic Probability
Probability is the measure of the likelihood that an event will occur. In case of
Random variable we are interested in knowing the Probabilities of getting different
values.
Hemant Rathore
18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • bdg@cognixia.com
Data Science with Python
Basics Of Statistics
Probability Distribution of RV
Table/ Chart/Formula to show relationship between Values and Corresponding Probabilities or
shows the distribution of probabilities by values.
Values(X) 1 2 3 4 5 6
Hemant Rathore
18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • bdg@cognixia.com
Data Science with Python
Basics Of Statistics
50-60 16 0.152 1
Total 105 1
Hemant Rathore
18008338228 +65 31586636 +1(973) 598-3969 44 203-808-4216 www.cognixia.com • bdg@cognixia.com
Data Science with Python