DS Lab Manual Final
DS Lab Manual Final
Index
Consider a dataset with student name, gender, Enrollment no, 4th semester result with marks of
each subject, his mobile number, city. Implement following in Python or R.
Practical-1
Aim: Perform descriptive analysis and identify the data type.
Problem/ Description:
2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted but they can
be measured. An example would be the height of a person. You can only describe them by using
intervals on the real number line.
i. Interval Data
Interval values represent ordered units that have the same difference. Therefore, we
speak of interval data when we have a variable that contains numeric values that are
ordered and where we know the exact differences between the values. A good example
would be a feature that contains temperature of a given place. The problem with interval
values data is that they don’t have a true zero. Because there is no true zero, a lot of
descriptive and inferential statistics can’t be applied.
ii. Ratio Data Ratio
values are ordered units with intermediate values. Ratio values are the same as
interval values, with the difference that they do have an absolute zero. Good examples are
height, weight, length etc.
Hints:
Import data and apply various statically calculations on columns. Then identify data type of each
attribute weather it is categorical or numerical etc.
Code:
#code here
Output:
#output here
Conclusion:
#Conclusion
Practical-2
Aim: Implement a method to find out variation in data. For example, the difference
between highest and lowest marks in each subject semester wise. Reference: csv file
Problem Statement:
The main purpose of variation is to find study of quality assurance by measuring the dispersion of
the population data of a probability or frequency distribution, or by determining the content or
quality of the sample data of substances.
Hints:
Measures of Variability: Variance
Find the mean of the data set.
Subtract the mean from each value in the data set.
Now square each of the values so that you now have all positive values.
Finally, divide the sum of the squares by the total number of values in the set to find the variance.
Description:
Types of Variation:
There are two basic types which can occur in a process:
● common cause
● special cause.
Common Cause:
Common cause variation happens in standard operating conditions. Think about the factory we
mentioned before. Fluctuations might occur due to:
● temperature
● humidity
● metal quality
● machine wear and tear.
Common cause variation has a trend that you can chart. In the factory mentioned before, product
differences might be caused by air humidity. You can chart those differences over time. Then you
can compare that chart to weather bureau humidity data.
Special Cause:
Conversely, special cause variation occurs in not standard operating conditions. Let’s go back to
the example factory mentioned before. Disparities could occur if:
● a substandard metal was delivered.
● one of the machines broke down.
● a worker forgot the process and made a lot of unusual mistakes.
Variation is the square of a sample’s standard deviation.
Variation = SD2
Code:
#code here
Output:
#output here
Conclusion:
#Conclusion
Practical-3
Aim: Plot the graph showing result of student in each semester. Reference: csv file
Problem/Description:
Introduction to pyplot:
matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB.
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area
in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track
of things like the current figure and plotting area, and the plotting functions are directed to the
current axes. Use histogram here.
Hints:
Import data and take out data required for plotting graph. Then plot graph according to fulfill aim.
Code:
#code here
Output:
#output here
Conclusion:
#conclusion
Practical-4
Aim: Plot the graph showing the geographical location (City Wise) of students.
Problem Statement:
Generate a bar graph of Cities vs Students such that the city name is on the X axis and
the number of students in a particular city is on the Y axis. Reference: csv file
Description:
Bar graph is a way of plotting two variables using the X and Y axes.
Bar graph is useful whenever we have to deal with two variable types, especially
when one is numerical and the other one is categorical.
It is a common practice to keep the numerical values on Y-axis and the categorical
ones on X-axis.
Hints:
Prepare a dataset with the categories ‘City Name’ and ‘Number of Students’ and
store it as a csv file. Read the file and plot it using required libraries. For example,
matplotlib.
Code:
#code here
Output:
#output here
Conclusion:
#conclusion
Practical-5
Aim: Plot the graph showing the number of male and female students.
Problem Statement:
Generate a bar graph of Male/Female vs Total Count such that the Count values on
the X axis and Male / Female Students on Y axis. Reference: data.csv file
Description:
In common practice, we generally keep numeric values on Y-axis but in this
problem, we have to put them on X-axis and the categorical values (Male/Female)
on Y-axis.
Hints:
Set the values of variables Male and Female and plot them on the axes mentioned
above using required libraries. For example, matplotlib.
Code:
#code here
Output:
#output here
Conclusion:
#conclusion
Practical-6
Aim: Implement a method to treat missing value for gender and missing value for
marks.
Problem:
Find all the null values in the for the gender and all subject marks, and replace them with
0. Reference: data.csv file
Description:
Ways to treat missing values:
1) Ignore the tuple (record/row):
• Usually done when class label is missing.
Example:
● The task is to distinguish between two types of emails, "spam" and "non-spam" (Ham)
● Spam & non-spam are called as class label.
● If an email comes to you, in which class label is missing then it is discarded.
2) Fill missing value manually
● Use the attribute mean (average) to fill in the missing value and also use the attribute mean
(average) for all samples belonging to the same class.
3) Use a global constant to fill in the missing value
Replace all the missing attribute values by the same constant such as a label like “Unknown”.
Hints:
Import data and check for missing values. Drop missing data, replace it with mean and Unknown
Label by specific lines of code.
Code:
#code here
Output:
#output here
Conclusion:
#conclusion
Practical-7
Aim: To predict the price based on total orders placed in a particular area. Use
appropriate regression method.
Problem Statement:
Make use of areaorders.csv file, if a particular area has 3300 order value, than predict its
price. Reference: areaorders.csv file
Description:
What is Regression?
Regression is a method to determine the statistical relationship between a dependent variable and
one or more independent variables. The change independent variable is associated with the
change in the independent variables. Regression has 7 types but it can be broadly classified into
two major types:
1. Linear Regression
The simplest case of linear regression is to find a relationship using a linear model (i.e line)
between an input independent variable (input single feature) and an output dependent variable. This
is called Bivariate Linear Regression. On the other hand, when there is a linear model representing
the relationship between a dependent output and multiple independent input variables is called
Multivariate Linear Regression. The dependent variable is continuous and independent variables
may or may not be continuous. We find the relationship between them with the help of the best fit
line which is also known as the Regression line.
2. Logistic Regression
It is used when the output is categorical. It is more like a classification problem. The output can
be Success / Failure, Yes / No, True/ False or 0/1. There is no need for a linear relationship between
the dependent output variable and independent input variables. If the output has only two
possibilities, then it is called Binary Logistic Regression. If the dependent output has more than two
output possibilities and there is no ordering in them, then it is called Multinomial Logistic
Regression. If there is order associated with the output and there are more than two output
possibilities then it is called Ordinal Logistic Regression.
Code:
#code here
Output:
#output here
Conclusion
#conclusion
Practical-8
Aim: Classify the student as average or clever. Use appropriate classification
technique based on data set. Reference: csv file
Problem Statement:
Based on the dataset, use appropriate classification techniques in order to
determine whether a student is average or clever.
Description:
Logistic Regression is generally used for classification purposes. Unlike Linear Regression, the
dependent variable can take a limited number of values only i.e, the dependent variable is
categorical. When the number of possible outcomes is only two it is called Binary Logistic
Regression.
Decision tree is a type of supervised learning algorithm (having a predefined target variable) that
is mostly used in classification problems. It works for both categorical and continuous input and
output variables. In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant differentiator in input variables.
Hints:
Step 1: Importing the libraries.
Step 2: Importing the dataset.
Step 3: Splitting the dataset into the Training set and Test set.
Step 4: Training the model on the training set.
Step 5: Predicting the Results.
Step 6: Comparing the Real Values with Predicted Values.
Code:
#code here
Output:
#output
Conclusion:
Practical-9
Aim: Use Titanic Dataset from www.kaggle.com and perform Titanic Survivor
Analysis.
Problem Statement:
3) Count the minimum and maximum age of passengers.
4) How many % of passengers had survived ?(Male and Female)
5) Plot the graph based on gender and survival.
Description:
Make use of the various python libraries in order to solve the above given
problems. Pandas can be really useful when dealing with the datasets like these.
We can read, manipulate the data in the way we like using the pandas library. We
can use matplotlib and other libraries like that in order to plot the details we
obtained.
Hints:
Read the csv file using pandas, find out the values mentioned above. Plot the
values using matplotlib or other similar libraries.
Code:
#code here
Output:
#output here
Conclusion:
#Conclusion