Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
121 views

DS Lab Manual Final

Here are the steps to plot a bar graph showing the geographical location (City Wise) of students: 1. Import the dataset and extract the columns containing city names and student counts. 2. Use pandas or numpy to group the data by city and calculate the count of students in each city. This will give you the student counts for each city. 3. Import matplotlib.pyplot and use plt.bar() to plot the bar graph. 4. Set the x-axis labels to the city names and y-axis labels to student count. 5. Add a title and labels to the graph. 6. Use plt.show() to display the bar graph. This will generate a

Uploaded by

Vivek Panchal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

DS Lab Manual Final

Here are the steps to plot a bar graph showing the geographical location (City Wise) of students: 1. Import the dataset and extract the columns containing city names and student counts. 2. Use pandas or numpy to group the data by city and calculate the count of students in each city. This will give you the student counts for each city. 3. Import matplotlib.pyplot and use plt.bar() to plot the bar graph. 4. Set the x-axis labels to the city names and y-axis labels to student count. 5. Add a title and labels to the graph. 6. Use plt.show() to display the bar graph. This will generate a

Uploaded by

Vivek Panchal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Science (3151608)

Data Science Lab Manual


(3151608)

L.D. College of Engineering,


Ahmedabad
(Affiliated to Gujarat Technological
University)

Enrollment No.: 200280723001 1


Data Science (3151608)

Index

Sr. Practical Date Page No. Sign


No
.
1 Perform descriptive analysis and identify the data type.

2 Implement a method to find out variation in data. For


example, the difference between highest and lowest
marks in each subject semester wise.

3 Plot the graph showing the result of students in each


semester.
4 Plot the graph showing the geographical location (City
Wise) of students.

5 Plot the graph showing the number of male and female


students.

6 Implement a method to treat missing value for gender


and missing value for marks.

7 To predict the price based on total orders placed in a


particular area. Use appropriate regression method.

8 Classify the student as average or clever. Use


appropriate classification technique based on data set.

9 Use Titanic Dataset from www.kaggle.com and perform


Titanic Survivor Analysis.
1) Count the minimum and maximum age of passengers.
2) How many % of passengers had survived ?(Male and
Female)
Plot the graph based on gender and survival.

Consider a dataset with student name, gender, Enrollment no, 4th semester result with marks of
each subject, his mobile number, city. Implement following in Python or R.

Enrollment No.: 200280723001 2


Data Science (3151608)

Practical-1
Aim: Perform descriptive analysis and identify the data type.

Problem/ Description:

Descriptive Analysis of data:


Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms, descriptive
statistics can be defined as the measures that summarize a given data, and these measures can be
broken down further into the measures of central tendency and the measures of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of
variability include standard deviation, variance, and the interquartile range. In this practical, we
will learn how to compute these measures of descriptive statistics and use them to interpret the
data.
1. Mean 2. Median 3. Mode
4. Standard Deviation 5. Variance 6. Interquartile Range
7. Skewness
Data & it's type:
Categorical data represents characteristics. Therefore, it can represent things like a person’s
gender, language etc.
1. Nominal Data Nominal values represent discrete units and are used to label variables, that have
no quantitative value. Nominal data that has no order. Therefore, if you would change the order of
its values, the meaning would not change.
2. Ordinal Data Ordinal values represent discrete and ordered units. It is therefore nearly the same
as nominal data, except that it’s ordering matters. Note that the difference between Elementary
and High School is different than the difference between High School and College. This is the
main limitation of ordinal data, the differences between the values is not really known.
Numerical Data
1. Discrete Data
We speak of discrete data if the data can only take on certain values. This type of data can’t be
measured but it can be counted. It basically represents information that can be categorized into a
classification. An example is the number of heads in 100-coin flips. Can you count it and can it be
divided up into smaller and smaller parts? On the contrary, if the data could be measured but not
counted, we would speak of continuous data.

Enrollment No.: 200280723001 3


Data Science (3151608)

2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted but they can
be measured. An example would be the height of a person. You can only describe them by using
intervals on the real number line.
i. Interval Data
Interval values represent ordered units that have the same difference. Therefore, we
speak of interval data when we have a variable that contains numeric values that are
ordered and where we know the exact differences between the values. A good example
would be a feature that contains temperature of a given place. The problem with interval
values data is that they don’t have a true zero. Because there is no true zero, a lot of
descriptive and inferential statistics can’t be applied.
ii. Ratio Data Ratio
values are ordered units with intermediate values. Ratio values are the same as
interval values, with the difference that they do have an absolute zero. Good examples are
height, weight, length etc.
Hints:
Import data and apply various statically calculations on columns. Then identify data type of each
attribute weather it is categorical or numerical etc.

Enrollment No.: 200280723001 4


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 5


Data Science (3151608)

Enrollment No.: 200280723001 6


Data Science (3151608)

Enrollment No.: 200280723001 7


Data Science (3151608)

Output:
#output here

Conclusion:
#Conclusion

Enrollment No.: 200280723001 8


Data Science (3151608)

Practical-2
Aim: Implement a method to find out variation in data. For example, the difference
between highest and lowest marks in each subject semester wise. Reference: csv file

Problem Statement:

The main purpose of variation is to find study of quality assurance by measuring the dispersion of
the population data of a probability or frequency distribution, or by determining the content or
quality of the sample data of substances.

Hints:
Measures of Variability: Variance
Find the mean of the data set.
Subtract the mean from each value in the data set.
Now square each of the values so that you now have all positive values.
Finally, divide the sum of the squares by the total number of values in the set to find the variance.

Description:
Types of Variation:
There are two basic types which can occur in a process:
● common cause
● special cause.
Common Cause:
Common cause variation happens in standard operating conditions. Think about the factory we
mentioned before. Fluctuations might occur due to:
● temperature
● humidity
● metal quality
● machine wear and tear.
Common cause variation has a trend that you can chart. In the factory mentioned before, product
differences might be caused by air humidity. You can chart those differences over time. Then you
can compare that chart to weather bureau humidity data.
Special Cause:

Enrollment No.: 200280723001 9


Data Science (3151608)

Conversely, special cause variation occurs in not standard operating conditions. Let’s go back to
the example factory mentioned before. Disparities could occur if:
● a substandard metal was delivered.
● one of the machines broke down.
● a worker forgot the process and made a lot of unusual mistakes.
Variation is the square of a sample’s standard deviation.
Variation = SD2

Enrollment No.: 200280723001 10


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 11


Data Science (3151608)

Enrollment No.: 200280723001 12


Data Science (3151608)

Enrollment No.: 200280723001 13


Data Science (3151608)

Output:
#output here

Conclusion:
#Conclusion

Enrollment No.: 200280723001 14


Data Science (3151608)

Practical-3

Aim: Plot the graph showing result of student in each semester. Reference: csv file

Problem/Description:
Introduction to pyplot:
matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB.
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area
in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track
of things like the current figure and plotting area, and the plotting functions are directed to the
current axes. Use histogram here.

Hints:
Import data and take out data required for plotting graph. Then plot graph according to fulfill aim.

Enrollment No.: 200280723001 15


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 16


Data Science (3151608)

Enrollment No.: 200280723001 17


Data Science (3151608)

Enrollment No.: 200280723001 18


Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 19


Data Science (3151608)

Practical-4

Aim: Plot the graph showing the geographical location (City Wise) of students.

Problem Statement:
Generate a bar graph of Cities vs Students such that the city name is on the X axis and
the number of students in a particular city is on the Y axis. Reference: csv file
Description:
Bar graph is a way of plotting two variables using the X and Y axes.
Bar graph is useful whenever we have to deal with two variable types, especially
when one is numerical and the other one is categorical.
It is a common practice to keep the numerical values on Y-axis and the categorical
ones on X-axis.
Hints:
Prepare a dataset with the categories ‘City Name’ and ‘Number of Students’ and
store it as a csv file. Read the file and plot it using required libraries. For example,
matplotlib.

Enrollment No.: 200280723001 20


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 21


Data Science (3151608)

Enrollment No.: 200280723001 22


Data Science (3151608)

Enrollment No.: 200280723001 23


Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 24


Data Science (3151608)

Practical-5
Aim: Plot the graph showing the number of male and female students.

Problem Statement:
Generate a bar graph of Male/Female vs Total Count such that the Count values on
the X axis and Male / Female Students on Y axis. Reference: data.csv file
Description:
In common practice, we generally keep numeric values on Y-axis but in this
problem, we have to put them on X-axis and the categorical values (Male/Female)
on Y-axis.
Hints:
Set the values of variables Male and Female and plot them on the axes mentioned
above using required libraries. For example, matplotlib.

Enrollment No.: 200280723001 25


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 26


Data Science (3151608)

Enrollment No.: 200280723001 27


Data Science (3151608)

Enrollment No.: 200280723001 28


Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 29


Data Science (3151608)

Practical-6
Aim: Implement a method to treat missing value for gender and missing value for
marks.
Problem:

Find all the null values in the for the gender and all subject marks, and replace them with
0. Reference: data.csv file
Description:
Ways to treat missing values:
1) Ignore the tuple (record/row):
• Usually done when class label is missing.
Example:
● The task is to distinguish between two types of emails, "spam" and "non-spam" (Ham)
● Spam & non-spam are called as class label.
● If an email comes to you, in which class label is missing then it is discarded.
2) Fill missing value manually
● Use the attribute mean (average) to fill in the missing value and also use the attribute mean
(average) for all samples belonging to the same class.
3) Use a global constant to fill in the missing value
Replace all the missing attribute values by the same constant such as a label like “Unknown”.

Hints:
Import data and check for missing values. Drop missing data, replace it with mean and Unknown
Label by specific lines of code.

Enrollment No.: 200280723001 30


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 31


Data Science (3151608)

Enrollment No.: 200280723001 32


Data Science (3151608)

Enrollment No.: 200280723001 33


Data Science (3151608)

Output:
#output here

Conclusion:
#conclusion

Enrollment No.: 200280723001 34


Data Science (3151608)

Practical-7
Aim: To predict the price based on total orders placed in a particular area. Use
appropriate regression method.

Problem Statement:
Make use of areaorders.csv file, if a particular area has 3300 order value, than predict its
price. Reference: areaorders.csv file
Description:
What is Regression?
Regression is a method to determine the statistical relationship between a dependent variable and
one or more independent variables. The change independent variable is associated with the
change in the independent variables. Regression has 7 types but it can be broadly classified into
two major types:
1. Linear Regression
The simplest case of linear regression is to find a relationship using a linear model (i.e line)
between an input independent variable (input single feature) and an output dependent variable. This
is called Bivariate Linear Regression. On the other hand, when there is a linear model representing
the relationship between a dependent output and multiple independent input variables is called
Multivariate Linear Regression. The dependent variable is continuous and independent variables
may or may not be continuous. We find the relationship between them with the help of the best fit
line which is also known as the Regression line.
2. Logistic Regression
It is used when the output is categorical. It is more like a classification problem. The output can
be Success / Failure, Yes / No, True/ False or 0/1. There is no need for a linear relationship between
the dependent output variable and independent input variables. If the output has only two
possibilities, then it is called Binary Logistic Regression. If the dependent output has more than two
output possibilities and there is no ordering in them, then it is called Multinomial Logistic
Regression. If there is order associated with the output and there are more than two output
possibilities then it is called Ordinal Logistic Regression.

Enrollment No.: 200280723001 35


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 36


Data Science (3151608)

Enrollment No.: 200280723001 37


Data Science (3151608)

Enrollment No.: 200280723001 38


Data Science (3151608)

Output:
#output here

Conclusion
#conclusion

Enrollment No.: 200280723001 39


Data Science (3151608)

Practical-8
Aim: Classify the student as average or clever. Use appropriate classification
technique based on data set. Reference: csv file

Problem Statement:
Based on the dataset, use appropriate classification techniques in order to
determine whether a student is average or clever.
Description:
Logistic Regression is generally used for classification purposes. Unlike Linear Regression, the
dependent variable can take a limited number of values only i.e, the dependent variable is
categorical. When the number of possible outcomes is only two it is called Binary Logistic
Regression.
Decision tree is a type of supervised learning algorithm (having a predefined target variable) that
is mostly used in classification problems. It works for both categorical and continuous input and
output variables. In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant differentiator in input variables.

Hints:
Step 1: Importing the libraries.
Step 2: Importing the dataset.
Step 3: Splitting the dataset into the Training set and Test set.
Step 4: Training the model on the training set.
Step 5: Predicting the Results.
Step 6: Comparing the Real Values with Predicted Values.

Enrollment No.: 200280723001 40


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 41


Data Science (3151608)

Enrollment No.: 200280723001 42


Data Science (3151608)

Enrollment No.: 200280723001 43


Data Science (3151608)

Output:
#output

Conclusion:

Enrollment No.: 200280723001 44


Data Science (3151608)

Practical-9
Aim: Use Titanic Dataset from www.kaggle.com and perform Titanic Survivor
Analysis.

Problem Statement:
3) Count the minimum and maximum age of passengers.
4) How many % of passengers had survived ?(Male and Female)
5) Plot the graph based on gender and survival.
Description:
Make use of the various python libraries in order to solve the above given
problems. Pandas can be really useful when dealing with the datasets like these.
We can read, manipulate the data in the way we like using the pandas library. We
can use matplotlib and other libraries like that in order to plot the details we
obtained.
Hints:
Read the csv file using pandas, find out the values mentioned above. Plot the
values using matplotlib or other similar libraries.

Enrollment No.: 200280723001 45


Data Science (3151608)

Code:
#code here

Enrollment No.: 200280723001 46


Data Science (3151608)

Enrollment No.: 200280723001 47


Data Science (3151608)

Enrollment No.: 200280723001 48


Data Science (3151608)

Output:
#output here

Conclusion:
#Conclusion

Enrollment No.: 200280723001 49

You might also like