0% found this document useful (0 votes)

121 views

DS Lab Manual Final

Here are the steps to plot a bar graph showing the geographical location (City Wise) of students: 1. Import the dataset and extract the columns containing city names and student counts. 2. Use pandas or numpy to group the data by city and calculate the count of students in each city. This will give you the student counts for each city. 3. Import matplotlib.pyplot and use plt.bar() to plot the bar graph. 4. Set the x-axis labels to the city names and y-axis labels to student count. 5. Add a title and labels to the graph. 6. Use plt.show() to display the bar graph. This will generate a

Uploaded by

Vivek Panchal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views

DS Lab Manual Final

Uploaded by

Vivek Panchal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Science (3151608)

Data Science Lab Manual

(3151608)

L.D. College of Engineering,

Ahmedabad
(Affiliated to Gujarat Technological
University)

Enrollment No.: 200280723001 1

Data Science (3151608)

Index

Sr. Practical Date Page No. Sign

No
.
1 Perform descriptive analysis and identify the data type.

2 Implement a method to find out variation in data. For

example, the difference between highest and lowest
marks in each subject semester wise.

3 Plot the graph showing the result of students in each

semester.
4 Plot the graph showing the geographical location (City
Wise) of students.

5 Plot the graph showing the number of male and female

students.

6 Implement a method to treat missing value for gender

and missing value for marks.

7 To predict the price based on total orders placed in a

particular area. Use appropriate regression method.

8 Classify the student as average or clever. Use

appropriate classification technique based on data set.

9 Use Titanic Dataset from www.kaggle.com and perform

Titanic Survivor Analysis.
1) Count the minimum and maximum age of passengers.
2) How many % of passengers had survived ?(Male and
Female)
Plot the graph based on gender and survival.

Consider a dataset with student name, gender, Enrollment no, 4th semester result with marks of
each subject, his mobile number, city. Implement following in Python or R.

Enrollment No.: 200280723001 2

Data Science (3151608)

Practical-1
Aim: Perform descriptive analysis and identify the data type.

Problem/ Description:

Descriptive Analysis of data:

Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms, descriptive
statistics can be defined as the measures that summarize a given data, and these measures can be
broken down further into the measures of central tendency and the measures of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of
variability include standard deviation, variance, and the interquartile range. In this practical, we
will learn how to compute these measures of descriptive statistics and use them to interpret the
data.
1. Mean 2. Median 3. Mode
4. Standard Deviation 5. Variance 6. Interquartile Range
7. Skewness
Data & it's type:
Categorical data represents characteristics. Therefore, it can represent things like a person’s
gender, language etc.
1. Nominal Data Nominal values represent discrete units and are used to label variables, that have
no quantitative value. Nominal data that has no order. Therefore, if you would change the order of
its values, the meaning would not change.
2. Ordinal Data Ordinal values represent discrete and ordered units. It is therefore nearly the same
as nominal data, except that it’s ordering matters. Note that the difference between Elementary
and High School is different than the difference between High School and College. This is the
main limitation of ordinal data, the differences between the values is not really known.
Numerical Data
1. Discrete Data
We speak of discrete data if the data can only take on certain values. This type of data can’t be
measured but it can be counted. It basically represents information that can be categorized into a
classification. An example is the number of heads in 100-coin flips. Can you count it and can it be
divided up into smaller and smaller parts? On the contrary, if the data could be measured but not
counted, we would speak of continuous data.

Enrollment No.: 200280723001 3

Data Science (3151608)

2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted but they can
be measured. An example would be the height of a person. You can only describe them by using
intervals on the real number line.
i. Interval Data
Interval values represent ordered units that have the same difference. Therefore, we
speak of interval data when we have a variable that contains numeric values that are
ordered and where we know the exact differences between the values. A good example
would be a feature that contains temperature of a given place. The problem with interval
values data is that they don’t have a true zero. Because there is no true zero, a lot of
descriptive and inferential statistics can’t be applied.
ii. Ratio Data Ratio
values are ordered units with intermediate values. Ratio values are the same as
interval values, with the difference that they do have an absolute zero. Good examples are
height, weight, length etc.
Hints:
Import data and apply various statically calculations on columns. Then identify data type of each
attribute weather it is categorical or numerical etc.

Enrollment No.: 200280723001 4

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 5

Data Science (3151608)

Enrollment No.: 200280723001 6

Data Science (3151608)

Enrollment No.: 200280723001 7

Data Science (3151608)

Output:
#output here

Conclusion:
#Conclusion

Enrollment No.: 200280723001 8

Data Science (3151608)

Practical-2
Aim: Implement a method to find out variation in data. For example, the difference
between highest and lowest marks in each subject semester wise. Reference: csv file

Problem Statement:

The main purpose of variation is to find study of quality assurance by measuring the dispersion of
the population data of a probability or frequency distribution, or by determining the content or
quality of the sample data of substances.

Hints:
Measures of Variability: Variance
Find the mean of the data set.
Subtract the mean from each value in the data set.
Now square each of the values so that you now have all positive values.
Finally, divide the sum of the squares by the total number of values in the set to find the variance.

Description:
Types of Variation:
There are two basic types which can occur in a process:
● common cause
● special cause.
Common Cause:
Common cause variation happens in standard operating conditions. Think about the factory we
mentioned before. Fluctuations might occur due to:
● temperature
● humidity
● metal quality
● machine wear and tear.
Common cause variation has a trend that you can chart. In the factory mentioned before, product
differences might be caused by air humidity. You can chart those differences over time. Then you
can compare that chart to weather bureau humidity data.
Special Cause:

Enrollment No.: 200280723001 9

Data Science (3151608)

Conversely, special cause variation occurs in not standard operating conditions. Let’s go back to
the example factory mentioned before. Disparities could occur if:
● a substandard metal was delivered.
● one of the machines broke down.
● a worker forgot the process and made a lot of unusual mistakes.
Variation is the square of a sample’s standard deviation.
Variation = SD2

Enrollment No.: 200280723001 10

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 11

Data Science (3151608)

Enrollment No.: 200280723001 12

Data Science (3151608)

Enrollment No.: 200280723001 13

Data Science (3151608)

Output:
#output here

Conclusion:
#Conclusion

Enrollment No.: 200280723001 14

Data Science (3151608)

Practical-3

Aim: Plot the graph showing result of student in each semester. Reference: csv file

Problem/Description:
Introduction to pyplot:
matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB.
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area
in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track
of things like the current figure and plotting area, and the plotting functions are directed to the
current axes. Use histogram here.

Hints:
Import data and take out data required for plotting graph. Then plot graph according to fulfill aim.

Enrollment No.: 200280723001 15

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 16

Data Science (3151608)

Enrollment No.: 200280723001 17

Data Science (3151608)

Enrollment No.: 200280723001 18

Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 19

Data Science (3151608)

Practical-4

Aim: Plot the graph showing the geographical location (City Wise) of students.

Problem Statement:
Generate a bar graph of Cities vs Students such that the city name is on the X axis and
the number of students in a particular city is on the Y axis. Reference: csv file
Description:
Bar graph is a way of plotting two variables using the X and Y axes.
Bar graph is useful whenever we have to deal with two variable types, especially
when one is numerical and the other one is categorical.
It is a common practice to keep the numerical values on Y-axis and the categorical
ones on X-axis.
Hints:
Prepare a dataset with the categories ‘City Name’ and ‘Number of Students’ and
store it as a csv file. Read the file and plot it using required libraries. For example,
matplotlib.

Enrollment No.: 200280723001 20

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 21

Data Science (3151608)

Enrollment No.: 200280723001 22

Data Science (3151608)

Enrollment No.: 200280723001 23

Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 24

Data Science (3151608)

Practical-5
Aim: Plot the graph showing the number of male and female students.

Problem Statement:
Generate a bar graph of Male/Female vs Total Count such that the Count values on
the X axis and Male / Female Students on Y axis. Reference: data.csv file
Description:
In common practice, we generally keep numeric values on Y-axis but in this
problem, we have to put them on X-axis and the categorical values (Male/Female)
on Y-axis.
Hints:
Set the values of variables Male and Female and plot them on the axes mentioned
above using required libraries. For example, matplotlib.

Enrollment No.: 200280723001 25

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 26

Data Science (3151608)

Enrollment No.: 200280723001 27

Data Science (3151608)

Enrollment No.: 200280723001 28

Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 29

Data Science (3151608)

Practical-6
Aim: Implement a method to treat missing value for gender and missing value for
marks.
Problem:

Find all the null values in the for the gender and all subject marks, and replace them with
0. Reference: data.csv file
Description:
Ways to treat missing values:
1) Ignore the tuple (record/row):
• Usually done when class label is missing.
Example:
● The task is to distinguish between two types of emails, "spam" and "non-spam" (Ham)
● Spam & non-spam are called as class label.
● If an email comes to you, in which class label is missing then it is discarded.
2) Fill missing value manually
● Use the attribute mean (average) to fill in the missing value and also use the attribute mean
(average) for all samples belonging to the same class.
3) Use a global constant to fill in the missing value
Replace all the missing attribute values by the same constant such as a label like “Unknown”.

Hints:
Import data and check for missing values. Drop missing data, replace it with mean and Unknown
Label by specific lines of code.

Enrollment No.: 200280723001 30

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 31

Data Science (3151608)

Enrollment No.: 200280723001 32

Data Science (3151608)

Enrollment No.: 200280723001 33

Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 34

Data Science (3151608)

Practical-7
Aim: To predict the price based on total orders placed in a particular area. Use
appropriate regression method.

Problem Statement:
Make use of areaorders.csv file, if a particular area has 3300 order value, than predict its
price. Reference: areaorders.csv file
Description:
What is Regression?
Regression is a method to determine the statistical relationship between a dependent variable and
one or more independent variables. The change independent variable is associated with the
change in the independent variables. Regression has 7 types but it can be broadly classified into
two major types:
1. Linear Regression
The simplest case of linear regression is to find a relationship using a linear model (i.e line)
between an input independent variable (input single feature) and an output dependent variable. This
is called Bivariate Linear Regression. On the other hand, when there is a linear model representing
the relationship between a dependent output and multiple independent input variables is called
Multivariate Linear Regression. The dependent variable is continuous and independent variables
may or may not be continuous. We find the relationship between them with the help of the best fit
line which is also known as the Regression line.
2. Logistic Regression
It is used when the output is categorical. It is more like a classification problem. The output can
be Success / Failure, Yes / No, True/ False or 0/1. There is no need for a linear relationship between
the dependent output variable and independent input variables. If the output has only two
possibilities, then it is called Binary Logistic Regression. If the dependent output has more than two
output possibilities and there is no ordering in them, then it is called Multinomial Logistic
Regression. If there is order associated with the output and there are more than two output
possibilities then it is called Ordinal Logistic Regression.

Enrollment No.: 200280723001 35

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 36

Data Science (3151608)

Enrollment No.: 200280723001 37

Data Science (3151608)

Enrollment No.: 200280723001 38

Data Science (3151608)

Output:
#output here

Conclusion
#conclusion

Enrollment No.: 200280723001 39

Data Science (3151608)

Practical-8
Aim: Classify the student as average or clever. Use appropriate classification
technique based on data set. Reference: csv file

Problem Statement:
Based on the dataset, use appropriate classification techniques in order to
determine whether a student is average or clever.
Description:
Logistic Regression is generally used for classification purposes. Unlike Linear Regression, the
dependent variable can take a limited number of values only i.e, the dependent variable is
categorical. When the number of possible outcomes is only two it is called Binary Logistic
Regression.
Decision tree is a type of supervised learning algorithm (having a predefined target variable) that
is mostly used in classification problems. It works for both categorical and continuous input and
output variables. In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant differentiator in input variables.

Hints:
Step 1: Importing the libraries.
Step 2: Importing the dataset.
Step 3: Splitting the dataset into the Training set and Test set.
Step 4: Training the model on the training set.
Step 5: Predicting the Results.
Step 6: Comparing the Real Values with Predicted Values.

Enrollment No.: 200280723001 40

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 41

Data Science (3151608)

Enrollment No.: 200280723001 42

Data Science (3151608)

Enrollment No.: 200280723001 43

Data Science (3151608)

Output:
#output

Conclusion:

Enrollment No.: 200280723001 44

Data Science (3151608)

Practical-9
Aim: Use Titanic Dataset from www.kaggle.com and perform Titanic Survivor
Analysis.

Problem Statement:
3) Count the minimum and maximum age of passengers.
4) How many % of passengers had survived ?(Male and Female)
5) Plot the graph based on gender and survival.
Description:
Make use of the various python libraries in order to solve the above given
problems. Pandas can be really useful when dealing with the datasets like these.
We can read, manipulate the data in the way we like using the pandas library. We
can use matplotlib and other libraries like that in order to plot the details we
obtained.
Hints:
Read the csv file using pandas, find out the values mentioned above. Plot the
values using matplotlib or other similar libraries.

Enrollment No.: 200280723001 45

Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 46

Data Science (3151608)

Enrollment No.: 200280723001 47

Data Science (3151608)

Enrollment No.: 200280723001 48

Data Science (3151608)

Output:
#output here

Conclusion:
#Conclusion

Enrollment No.: 200280723001 49

Reels Bundle With Bonuses
100% (8)
Reels Bundle With Bonuses
3 pages
SM Mag2000 EN A Original
67% (3)
SM Mag2000 EN A Original
27 pages
Education - Post 12th Standard - CSV
88% (16)
Education - Post 12th Standard - CSV
11 pages
Piccoli, Gabriele and Pigni, Federico. Information Systems For Managers. Without Cases. Prospect Press 5.0 Edition
100% (1)
Piccoli, Gabriele and Pigni, Federico. Information Systems For Managers. Without Cases. Prospect Press 5.0 Edition
7 pages
Line 6 - Spider 3 - 1508-3012-HD75
0% (1)
Line 6 - Spider 3 - 1508-3012-HD75
68 pages
Updated - STA416 - Project Guidelines
No ratings yet
Updated - STA416 - Project Guidelines
3 pages
Chapter 2 - Preparing To Model
No ratings yet
Chapter 2 - Preparing To Model
16 pages
Data-Science-Assignments
No ratings yet
Data-Science-Assignments
6 pages
MDM4U
No ratings yet
MDM4U
2 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Summary_ Lifecycle of Data Analysis -3982
No ratings yet
Summary_ Lifecycle of Data Analysis -3982
7 pages
Exam-1
No ratings yet
Exam-1
12 pages
DE&V TWO MARKS QUESTIONS WITH ANSWERS
No ratings yet
DE&V TWO MARKS QUESTIONS WITH ANSWERS
19 pages
Unit-1
No ratings yet
Unit-1
52 pages
05_AIHC_Exp02
No ratings yet
05_AIHC_Exp02
11 pages
Types of data
No ratings yet
Types of data
12 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Bank Loan Case Study PRO 6 1
No ratings yet
Bank Loan Case Study PRO 6 1
24 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Data mining
No ratings yet
Data mining
4 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
13 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
unit 5 (1) (1)
No ratings yet
unit 5 (1) (1)
5 pages
business Analytics (tanya pandey) mba m3a
No ratings yet
business Analytics (tanya pandey) mba m3a
64 pages
Ml Chapter 2
No ratings yet
Ml Chapter 2
9 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
ESE-Theory Question -bank
No ratings yet
ESE-Theory Question -bank
6 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
UNIT II-DSDA.docx Notes
No ratings yet
UNIT II-DSDA.docx Notes
26 pages
Q_Solve_Bigdata
No ratings yet
Q_Solve_Bigdata
25 pages
SPA unit-1
No ratings yet
SPA unit-1
11 pages
Dev 1
No ratings yet
Dev 1
2 pages
SMA_Expt_4
No ratings yet
SMA_Expt_4
13 pages
Notes Stats
No ratings yet
Notes Stats
21 pages
IS5312 Mini Project-2
No ratings yet
IS5312 Mini Project-2
5 pages
SPSS Data Analysis
100% (6)
SPSS Data Analysis
47 pages
Students Alfredo de Alba Alvarado Eduardo Melendrez Escobedo Kenya Giselle Martinez Puente Bryton César Arguelles Aguilar
No ratings yet
Students Alfredo de Alba Alvarado Eduardo Melendrez Escobedo Kenya Giselle Martinez Puente Bryton César Arguelles Aguilar
6 pages
EDA
100% (1)
EDA
9 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
UNIT 1
No ratings yet
UNIT 1
23 pages
Data Science Full
No ratings yet
Data Science Full
31 pages
k
No ratings yet
k
11 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
Assignment DSBDS Insem
No ratings yet
Assignment DSBDS Insem
6 pages
FDS - 5 SOLVED
No ratings yet
FDS - 5 SOLVED
13 pages
Section 6 Data - Statistics For Quantitative Study
No ratings yet
Section 6 Data - Statistics For Quantitative Study
142 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Data Science Full
No ratings yet
Data Science Full
32 pages
Engineering Data Analysis Comprehensive Notes and Examples
No ratings yet
Engineering Data Analysis Comprehensive Notes and Examples
4 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Sta I06 Lecture Note
No ratings yet
Sta I06 Lecture Note
29 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
lesson12
No ratings yet
lesson12
8 pages
DS Assignment
No ratings yet
DS Assignment
12 pages
prw questions
No ratings yet
prw questions
31 pages
Ai Hon 4
No ratings yet
Ai Hon 4
22 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
Introduction To Engineering Data Analysis 2
No ratings yet
Introduction To Engineering Data Analysis 2
5 pages
Data Analytics Week 3
100% (1)
Data Analytics Week 3
42 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Information Infrastructure and Security
No ratings yet
Information Infrastructure and Security
10 pages
TMO SM-G900T Galaxy S5 English User Manual NCH F5
No ratings yet
TMO SM-G900T Galaxy S5 English User Manual NCH F5
103 pages
Catalog M-Y - I - N Di Mao Qu - N T - NG Minicap Flex Piercing CN030913
No ratings yet
Catalog M-Y - I - N Di Mao Qu - N T - NG Minicap Flex Piercing CN030913
2 pages
SSIS Demo Project1212
No ratings yet
SSIS Demo Project1212
22 pages
19 May Congress of Innovative Scientific Approaches: 10Th International
No ratings yet
19 May Congress of Innovative Scientific Approaches: 10Th International
418 pages
Mohammed Omer
No ratings yet
Mohammed Omer
2 pages
ASUS Fatal1ty Z68 Professional Gen3 - Moderkort PDF
No ratings yet
ASUS Fatal1ty Z68 Professional Gen3 - Moderkort PDF
86 pages
InteliDrive IPC Datasheet
No ratings yet
InteliDrive IPC Datasheet
4 pages
Schneider Electric Altivar-Process-ATV900 ATV930C16N4
No ratings yet
Schneider Electric Altivar-Process-ATV900 ATV930C16N4
13 pages
RoyalZProduction Music Video Guide
No ratings yet
RoyalZProduction Music Video Guide
61 pages
Ex. No. 03 Construct An Application That Draws Basic Graphical Primitives On The Screen Date
No ratings yet
Ex. No. 03 Construct An Application That Draws Basic Graphical Primitives On The Screen Date
4 pages
QS OPTICHECK en 140630 4003706601 R01 1000274348 1
No ratings yet
QS OPTICHECK en 140630 4003706601 R01 1000274348 1
16 pages
Report Writing Presentation Evaluation Guidelines For BSC CSIT Project
No ratings yet
Report Writing Presentation Evaluation Guidelines For BSC CSIT Project
39 pages
Guide To Networking Essentials Fifth Edition: Network Interface Cards
No ratings yet
Guide To Networking Essentials Fifth Edition: Network Interface Cards
29 pages
DATASTAGE Unable To Stop Job in DataStage Director - by Aysenur Karatay - Medium
No ratings yet
DATASTAGE Unable To Stop Job in DataStage Director - by Aysenur Karatay - Medium
8 pages
Design and Experimental Study On Fresnel Lens
No ratings yet
Design and Experimental Study On Fresnel Lens
7 pages
Pvsyst Sa - Route de La Maison-Carrée 30 - 1242 Satigny - Switzerland
No ratings yet
Pvsyst Sa - Route de La Maison-Carrée 30 - 1242 Satigny - Switzerland
34 pages
Data Mining Unit 3 Classification-1
No ratings yet
Data Mining Unit 3 Classification-1
24 pages
Tutorial - The Computer Music Guide To Soundfonts
100% (1)
Tutorial - The Computer Music Guide To Soundfonts
15 pages
Led LCD TV: Service Manual
No ratings yet
Led LCD TV: Service Manual
45 pages
Web-Based Employee Attendance System Development U
No ratings yet
Web-Based Employee Attendance System Development U
13 pages
Untitled
No ratings yet
Untitled
15 pages
Print - Udyam Registration Certificate
No ratings yet
Print - Udyam Registration Certificate
2 pages
Metal Shark 174 2 BD
No ratings yet
Metal Shark 174 2 BD
95 pages
Shubham Kumar
No ratings yet
Shubham Kumar
1 page
UNIT 3 DV (1)
No ratings yet
UNIT 3 DV (1)
44 pages