Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
190 views

Data Science 3

This document provides an introduction and syllabus for a course on Introduction to Data Science. It includes 5 units that will cover topics such as data collection and management, data analysis, data visualization, and case studies. Data analysis concepts that will be discussed include measures of central tendency, distributions, and basic machine learning algorithms. Students will learn how to interpret and visualize data, and apply coding techniques to handle data.

Uploaded by

Akhil Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views

Data Science 3

This document provides an introduction and syllabus for a course on Introduction to Data Science. It includes 5 units that will cover topics such as data collection and management, data analysis, data visualization, and case studies. Data analysis concepts that will be discussed include measures of central tendency, distributions, and basic machine learning algorithms. Students will learn how to interpret and visualize data, and apply coding techniques to handle data.

Uploaded by

Akhil Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 216

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &

MACHINE LEARNING

INTRODUCTION TO DATA SCIENCE

LECTURE NOTES – UNIT 3

B. TECH
II YEAR – II SEM (Sec-A & B)
Academic Year 2022-23

Prepared & compiled by

DR.G. ARUN SAMPAUL THOMAS,


ASSOCIATE PROFESSOR & HOD, DEPARTMENT OF AI&ML
J.B.I.E.T
Bhaskar Nagar, Yenkapally(V), Moinabad(M),

Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.


J. B. Institute of Engineering and
AY 2020-21 B. Tech: AI & ML
Technology
onwards II Year – II Sem
(UGC Autonomous)
Course Code:
INTRODUCTION TO DATA SCIENCE L T P D
J22D3
Credits: 2 2 0 0 0

Pre-requisite:
Database Management Systems, Data Structures

Course Objectives:
This course will enable students to:
• Know about the fundamental concepts and technologies of Data Science.
• Explore the various Data collection and storage methods.
• Understand the Data Analysis, statistics, and various machine learning algorithms.
• Investigate about the visualization of data and apply coding techniques to data for
securing the data.
• Study the Applications of Data Science, Technologies for visualization Handling of
variables using Python.

UNIT-I - Introduction to Data Science


Introduction to core concepts and technologies: Introduction, Terminology, Data science
Process, data science toolkit, Types of data, Example applications

UNIT-II - Data collection and management:


Introduction, Sources of data, Data collection and APIs, Exploring and fixing data. Data storage
and management, using multiple data sources.

UNIT-III - Data analysis:


Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and
distributions, Variance, Distribution properties and arithmetic, Samples/CLT. Basic machine
learning algorithms, Linear regression, SVM, Naive Bayes.

UNIT-IV - Data visualization:


Introduction, Types of data visualization, Data for visualization:
Data types, Data encodings, Retinal variables, mapping variables to encodings, Visual
encodings.

UNIT-V - Practices and Case Studies in Data Science:


Applications of Data Science, Technologies for visualization, Recent trends in various data
collection and analysis techniques, various visualization techniques, application development
methods used in data science. Demonstrate some case studies like Marketing, Finance, HR,
Manufacturing, Healthcare etc

Textbooks:
1. Cathy O’Neil, Rachel Schutt, Doing Data Science, Straight Talk from the Frontline. O’Reilly,
2013.
2. Jure Leskovek, Anand Rajaraman, Jeffrey Ullman, Mining of Massive Datasets. v 2.1,
Cambridge University Press, 2014.
Reference Books:
1. Joel Grus, “Data Science from scratch”, O'Reilly, 2015.
2. Gupta, S.C. and Kapoor, V.K.: “Fundamentals of Mathematical Statistics”, Sultan &
Chand & Sons, New Delhi, 11th Ed, 2002.
3. Hastie, Trevor, et al. “The elements of Statistical Learning”, Springer, 2009.
4. Wes Mc Kinney, “Python for Data Analysis”, O'Reilly Media, 2012

Course Outcomes:
The student will be able to
• Identify the basic concepts of data science and identify the types of data.
• Analyse about how to collect the data, manage the data, explore the data, store the data.
• Implement the basic measures of central tendency and classify the data using SVM and
navie Bayesian.
• Interpret the visualization of data and apply coding techniques to data for securing the
data.
• Analyse the various concepts of data science and can be able to handle simple
applications of data science using python.

WEBSITE REFERENCES FOR SELF LEARNING


1. https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-
scratch-2/
2. https://www.rstudio.com/online-learning/
INTRODUCTION TO DATA SCIENCE

UNIT– III
Ø Intro to Data Analysis

Ø Basics of Terminology
and concepts

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
Introduction to Data Science - SYLLABUS

2
Topics Covered

Topics Covered
• Descriptive analysis
o Ratio, proportion, percentage, and rate
o Median, mean, and trend
• Selection of the appropriate chart
Data Analysis: Key Concepts
Data Analysis

Analysis: Turning raw data


into useful information

Purpose: To provide answers to


questions being asked by a
health program

Even the greatest amount and


best quality of data mean
nothing if data are not
properly analyzed—or
analyzed at all.
Data Analysis

Analysis does not mean


using a computer software
package.

Analysis is looking at the


data in light of the
questions you need to
answer:
• How would you analyze
data to determine: “Is
my program meeting its
objectives?”
Answering Program Questions

Question: Is my program meeting its objectives?


Analysis: Compare program targets and actual
program performance to learn how far you are
from the targets
Interpretation: Why have you achieved or not
achieved a target, and what does this mean for
your program?
Answering may require more information.
Descriptive Analysis

Describes the sample/target population


(demographic and clinical characteristics)

Does not define causality; tells you what, not why

Example: Average number of clients seen per


month
Basic Terminology and Concepts

Statistical terms
• Ratio
• Proportion
• Percentage
• Rate
• Mean
• Median
• Trend
Central Tendency

Measures of the location of the middle or the


center of a distribution of data
• Mean
• Median
Mean

The average of your dataset


The value obtained by dividing the sum of a set
of quantities by the number of quantities in the
set

Example:
(22+18+30+19+37+33) = 159 ÷ 6 = 26.5

The mean is sensitive to extreme values


Calculating the Mean

Average number of clients counseled per month

January: 30 30+45+38+41+37+40 = 231


February: 45 clients
March: 38 231 clients ÷ 6 months = 38.5
April: 41 Mean = 38.5 clients/month
May: 37
June: 40
Median

The middle of a distribution (when numbers are in order:


that is, half of the numbers are above the median and
half are below the median)

The median is not as sensitive to extreme values as the


mean.

Odd number of numbers, median = the middle number


Median of 2, 4, 7 = 4

Even number of numbers, median = mean of the two


middle numbers
Median of 2, 4, 7, 12 => (4+7) /2 = 5.5
Calculating the Median

Client 1 – 2
Client 2 – 134
Client 3 – 67
Client 4 – 10
Client 5 – 221
Median of clients 1–5 = 67
Median of clients 1–4 = 100.5
(67+134=201/2 = 100.5)
Mean vs. Median: When to Use One or the Other?

EXAMPLE 1 # patients / dr.


Mean = ? Median = ?
Facility 1 20

29.7 Facility 2 22 29

Facility 3 26

Facility 4 29

Facility 5 34

Facility 6 38

Facility 7 39
Mean vs. Median: When to Use One or the Other?

EXAMPLE 1 # patients / dr.


Mean = ? Median = ?
Facility 1 8

50.8 Facility 2 38 40

Facility 3 39

Facility 4 40

Facility 5 45

Facility 6 46

Facility 7 140
Use the Mean or the Median?

CD4 count

Client 1 9

Client 2 11

Client 3 92

Client 4 92

Client 5 95

Client 6 100

Client 7 100

Client 8 101

Client 9 104

Client 10 206
Trend

A trend is a pattern of gradual change in a


condition, output, or process, or an average or
general tendency of a series of data points to
move in a certain direction over time,
represented by a line or curve on a graph.

To follow a trend you must not only be aware of


what is currently happening but also be astute
enough to predict what is going to happen in
the future.
Calculating Trends

Adults and children on antiretroviral therapy (ART), 2008–2011


200
180
# of people (in thousands)

160
140
120
100 # adults on ART
# children on ART
80
60
40
20
0
2008 2009 2010 2011

19
Calculating Trends

Adults on ART and children on ART, 2011


200
180
# of people (Hundreds)

160
140
120
100
# adults on ART
80 # children on ART
60
40
20
0
r pr l t
Ja n
Feb
M
a
A M ay un
J Ju ug Se p Oc Nov ec
A D

20
Key Messages

• Purpose of analysis: Provide answers to


programmatic questions

• Descriptive analyses describe the sample or


target population.

• Descriptive analyses do not define causality.


That is, they tell you what, not why.

21
SELECT THE RIGHT CHART
Types of Charts
5 QUESTIONS TO ASK YOURSELF
WHEN CHOOSING A CHART
5 Questions to Ask Yourself When
Choosing a Chart

1. Want to compare values?


Charts are perfect for comparing one or many
value sets, and they can easily show the low and
high values in the data sets.

Use these charts to show comparisons:


• Column/bar
• Circular area
• Line
• Scatter plot
• Bullet
5 Questions to Ask Yourself When
Choosing a Chart

2. Want to show the composition of something?


To show how individual parts make up the whole
of something (such as the device used for
mobile visitors to your website, or total sales
broken down by sales rep)

Use these charts to show composition:


• Pie
• Stacked bar
• Stacked column
• Area
5 Questions to Ask Yourself When
Choosing a Chart

3. Want to understand the distribution of your data?


Distribution charts help you to understand outliers,
the normal tendency, and the range of
information in your values.

Use these charts to show distribution:


• Scatter plot
• Line
• Column
• Bar
5 Questions to Ask Yourself When
Choosing a Chart

4. Interested in analyzing trends in your data set?


If you want more information about how a data
set performed during a specific period, there are
specific chart types that do this extremely well.

Use these charts to analyze trends:


• Line
• Dual-axis line
• Column
5 Questions to Ask Yourself When
Choosing a Chart

5. Want to better understand the relationships


among value sets?
Relationship charts are designed to show how one
variable relates to one or many different variables.
You could show how something positively affects (or
has no effect, or negatively affects) another
variable.

Use these charts to show relationships:


• Scatter plot
• Bubble
• Line
Examples of Charts to Choose
When Analyzing Data

% of HIV-positive women per


Column region
• To show a
comparison
among different
items
• To show a
comparison of
items over time
Examples of Charts to Choose
When Analyzing Data

Bar Enrollment of HIV clients


in ART in 3 regions
• Should be used to
avoid clutter when
one data label is
long or if you have
more than 10 items
to compare
• Can also be used
to display negative
numbers
Examples of Charts to Choose
When Analyzing Data

Number of clinicians
Line working in each clinic in
A line chart reveals
Years 1–4
trends or progress
over time.
• Can be used to
show many
different categories
of data
Use a line chart to
show a continuous
data set.
Examples of Charts to Choose
When Analyzing Data

Dual axis
• Used with 2–3 data sets,
at least one of which is
based on a continuous
set of data, and another
of which is better suited
to being grouped by
category
• Should be used to
visualize a correlation, or
the lack
thereof, between these
three data sets
.
Example of Charts to Choose
When Analyzing Data

Area Enrollment of HIV clients


• Useful for showing
in ART in 3 regions
part-to-whole
relationships, such as
individual data’s
contribution to the
total for a given
period
• Helps you
analyze both overall
and individual trend
information
Example of Charts to Choose
When Analyzing Data
Number of months female and
Stacked bar male patients have been
enrolled in HIV care, by age
• Should be used
group
to compare
many items and
show the
composition of
each one
• Represents
components of a
whole and
compares
wholes
Example of Charts to Choose
When Analyzing Data

Pie
• Represents
percentages,
with the segments
totaling 100
Example of Charts to Choose
When Analyzing Data

Customer happiness, by
Scatter plot response time

• Can show relationship


between two variables,
or reveal the distribution
trends
• Should be used when
there are many data
points, and you want to
highlight similarities in
the data set
• Useful when you are
looking for outliers or
want to understand the
distribution of your data
INTRODUCTION TO DATA SCIENCE

UNIT– III
Ø Descriptive Statistics

Ø Central tendency,
Variance, Mean,
Median etc., Concepts

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
2

Introduction to Data Science - SYLLABUS


INTRODUCTION TO DATA SCIENCE

UNIT– III
Ø Distribution properties
and arithmetic

Ø Samples / Central Limit


Theorem (CLT)

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
2

Introduction to Data Science - SYLLABUS


Example – Die Distribution - CLT
INTRODUCTION TO DATA SCIENCE

UNIT– III
Ø Intro to ML

Ø Basic ML Algorithms

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
2

Introduction to Data Science - SYLLABUS


SUPERVISED Vs Unsupervised
INTRODUCTION TO DATA SCIENCE

UNIT– III
Ø Linear Regression

Ø SVM & its Apps

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
2

Introduction to Data Science - SYLLABUS

You might also like