CS3361 Data Science Lab Manual
CS3361 Data Science Lab Manual
Information Technology by adapting to the rapid technological advancement, and efficiently mold the
engineers’ knowledge with theory and practice, using teaching skill utilizing the latest practical
M1. To produce successful graduates with personal and professional responsibilities and
M2. To impart professional ethics, social responsibilities, entrepreneur skills and value-based IT
education.
1
DATA SCIENCE LAB MANUAL
COURSE OUTCOMES:
At the end of this course, the students will be able to:
COs Cognitive Level Course Outcomes
CO1 Create Make use of the python libraries for data science
CO2 Create Make use of the basic Statistical and Probability measures for data science.
CO3 Apply Perform descriptive analytics on the benchmark data sets.
CO4 Apply Perform correlation and regression analytics on standard data sets
CO5 Apply Present and interpret data using visualization packages in Python.
Apply data science concepts and methods to solve problems in real-world contexts
CO6 Apply and will communicate these solutions effectively
Exp.
Name of the Experiment COs POs PSOs
No
2
DATA SCIENCE LAB MANUAL
3
DATA SCIENCE LAB MANUAL
INDEX
Reading data from text files, Excel and the web and
4. exploring various commands for doing descriptive 32
analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians
38
Diabetes data set
7.
Visualizing Geographic Data with Basemap 55
Sample Exercise 63
4
DATA SCIENCE LAB MANUAL
EX.No:1
Download, install and explore the features of NumPy, SciPy, Jupyter,
Statsmodelsand Pandas packages.
Date:
Related Co &
CO1 & PO1,2,3,4,9,10,11,12
PO’s
Aim
To Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages by using pip command.
Basic Tools:
a. Python
b. Numpy
c. Scipy
d. Matplotlib
e. Pandas
f. statmodels
g. seaborn
h. plotly
i. bokeh
1. Python
Python is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant
syntax and dynamic typing, together with its interpreted nature, make it an ideal language for
scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or
binary form for all major platforms from the Python web site, https://www.python.org/, and may be
freely distributed.
Installation Commands:
Step 1: Download the Python Installer binaries. Open the official Python website in
your web browser. ...
Step 2: Run the Executable Installer. Once the installer is downloaded, run the
Python installer. ...
Step 3: Add Python to environmental variables. ...
Step 4: Verify the Python Installation.
5
DATA SCIENCE LAB MANUAL
6
DATA SCIENCE LAB MANUAL
2. Numpy
NumPy stands for Numerical Python and it is a core scientific computing library
in Python. It provides efficient multi-dimensional array objects and various operations
to work with these array objects.
Package installer for Python (pip) needed to run Python on your computer.
Installation Commands:
1. Command Prompt : Py –m pip - -version
2. Command Prompt :Py –m pip install numpy
7
DATA SCIENCE LAB MANUAL
3. Scipy
SciPy is a scientific computation library that uses NumPy underneath.SciPy stands for
Scientific Python.It provides more utility functions for optimization, stats and signal
processing.LikeNumPy, SciPy is open source so we can use it freely.
Installation Commands:
4. Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Installation Commands:
5. Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation
tool, built on top of the Python programming language
Installation Commands:
6. Jupyter
The Jupyter Notebook is the original web application for creating and sharing
computational documents. It offers a simple, streamlined, document-centric experience.
Installation Commands:
9
DATA SCIENCE LAB MANUAL
6. Statmodels
Statsmodels is a Python package that allows users to explore data, estimate statistical
models, and perform statistical tests.
Installation Commands:
7. Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics.
Installation Commands:
8. Plotly
Plotly is a technical computing company headquartered in Montreal, Quebec, that
develops online data analytics and visualization tools. Plotly provides online graphing,
analytics, and statistics tools for individuals and collaboration, as well as scientific graphing
libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Installation Commands:
10
DATA SCIENCE LAB MANUAL
9. Bokeh
Bokeh is a Python library for creating interactive visualizations for modern web browsers. It
helps you build beautiful graphics, ranging from simple plots to complex dashboards with
streaming datasets.
Installation Commands:
print(arr)
Output
2. import pandas as pd
print(arr)
3. Draw a line in a diagram from position (0,0) to position (6,250) using Matplotlib
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
Example
Specify a new color for each wedge:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
mycolors = ["black", "hotpink", "b", "#4CAF50"]
plt.pie(y, labels = mylabels, colors = mycolors)
plt.show()
Result
Thus the Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels
and Pandas packages by using pip command and basic program was executed.
12
DATA SCIENCE LAB MANUAL
NUMPY
13
DATA SCIENCE LAB MANUAL
EX. No:2
Working with Numpy Arrays
Date:
C
Aim :
Algorithm:
1. Start the Program.
2. Import Numpy Library.
3. Perform operation with Numpy Array.
4. Display the output.
5. Stop the Program.
Numpy:
Numpy stands for Numerical Python. It is a Python library used for working with an
array. In Python, we use the list for purpose of the array but it’s slow to process. NumPy array
is a powerful N-dimensional array object and its use in linear algebra, Fourier transform, and
random number capabilities. It provides an array object much faster than traditional Python
lists.
1. numpy.array(): The Numpy array object in Numpy is called ndarray. We can create ndarray
using numpy.array() function.
Syntax: numpy.array(parameter)
2. numpy.arange(): This is an inbuilt NumPy function that returns evenly spaced values
within a given interval.
14
DATA SCIENCE LAB MANUAL
4. numpy.empty(): This function create a new array of given shape and type,
without initializing value.
5. numpy.ones(): This function is used to get a new array of given shape and type, filled
with ones(1).
6. numpy.zeros(): This function is used to get a new array of given shape and type, filled
with zeros(0).
NumPy is used to work with arrays. The array object in NumPy is called
ndarray. We can create a NumPy ndarray object by using the array() function.
import numpy as np
print(arr)
OUTPUT
[1 2 3 4 5]
Dimensions in Arrays
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
15
DATA SCIENCE LAB MANUAL
import numpy as np
print(arr)
OUTPUT
[[1 2 3]
[[1 2 3]
NumPy Arrays provides the ndim attribute that returns an integer that tells us how many
dimensions the array have.
Example
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
When the array is created, you can define the number of dimensions by using
the ndmin argument.
Example
import numpy as np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)
OUTPUT
[[[[[1 2 3 4]]]]]
number of dimensions : 5
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
import numpy as np
print(arr[0])
OUTPUT
Example
import numpy as np
OUTPUT
Slicing in python means taking elements from one given index to another given index.
<slice> = <array>[start:stop]
To slice a one dimensional array, I provide a start and an end number separated by a
semicolon (:). The range then starts at the start number and one before the end number.
18
DATA SCIENCE LAB MANUAL
Program:
1 import numpy as np
2 a = np.array([(8,9,10),(11,12,13)])
3 print(a)
4 a=a.reshape(3,2)
5 print(a)
Output– [[ 8 9 10] [11 12 13]] [[ 8 9] [10 11] [12 13]]
Aggregations
The Python numpy aggregate functions are sum, min, max, mean, average, product,
median, standard deviation, variance, argmin, argmax, percentile, cumprod, cumsum, and
corrcoef.
Functions
min() function returns the item with the lowest value, or the item with the lowest value in an
iterable.
20
DATA SCIENCE LAB MANUAL
max() function returns the item with the highest value, or the item with the highest value in an
iterable.
max(iterable)
In Machine Learning (and in mathematics) there are often three values that interests us:
Mean
import numpy
speed=[99,86,87,88,111,86,103,87,94,78,77,85,86]
x=numpy.mean(speed)
print(x)
Output
89.76923076923077
Median
21
DATA SCIENCE LAB MANUAL
import numpy
speed=[99,86,87,88,111,86,103,87,94,78,77,85,86]
x=numpy.median(speed)
print(x)
Output
87.0
Mode
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x=numpy.mode(speed)
print(x)
Output
86,87
Standard deviation is a number that describes how spread out the values are.
import numpy
speed=[86,87,88,86,87,85,86]
x=numpy.std(speed)
print(x)
Output
0.90
Joining NumPy Arrays
Example
22
DATA SCIENCE LAB MANUAL
import numpy as np
print(arr)
OUTPUT
[1 2 3 4 5 6]
Splitting NumPy Arrays
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.
import numpy as np
newarr = np.array_split(arr, 3)
print(newarr)
OUTPUT
[array([1, 2]), array([3, 4]), array([5, 6])]
To create your own ufunc, you have to define a function, like you do with normal functions in
Python, then you add it to your NumPyufunc library with the frompyfunc() method.
23
DATA SCIENCE LAB MANUAL
Example
import numpy as np
myadd = np.frompyfunc(myadd, 2, 1)
OUTPUT
[6 8 10 12]
Sorting Arrays
Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.
The NumPyndarray object has a function called sort(), that will sort a specified array.
Example
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
OUTPUT
[0 1 2 3]
24
DATA SCIENCE LAB MANUAL
Result :
Thus python programs for creating and accessing arrays have been executed.
25
DATA SCIENCE LAB MANUAL
PANDAS
26
DATA SCIENCE LAB MANUAL
EX.No:3
Working with Pandas data frames
Date:
Aim:
To write Python programs for using pandas data frames and accessing it.
Alogorithm:
1. Start the Program.
2. Import Numpy & Pandas Packages.
3. Create a Dataframe for the list of elements.
4. Load a Dataset from an external source into a pandas dataframe
5. Display the Output.
6. Stop the Program
Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas
SERIES
A Pandas Series is like a column in a table.It is a one-dimensional array holding data
of any type.
Example
Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:
27
DATA SCIENCE LAB MANUAL
DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
Example
Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
output
Dura tion PulseMax pulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
28
DATA SCIENCE LAB MANUAL
In DataFrame sometimes many datasets simply arrive with missing data, either because it
exists and was not collected or it never existed.
2. NaN :NaN (an acronym for Not a Number), is a special floating-point value recognized by
all systems that use the standard IEEE floating-point representation
Loading Data
First of all, let’s import the libraries.
Missing data in the Pandas is represented by the value NaN (Not a Number).
You can use the isnull method to see missing data in the data.
The notnull method does the opposite of the isnull method. Let me show that.
29
DATA SCIENCE LAB MANUAL
If you want to remove the missing data, you can use the dropna method.
Hierarchical Indexes
dataset. Data frames can have hierarchical indexes. To show this, let me create a dataset.
30
DATA SCIENCE LAB MANUAL
import pandas as pd
df = pd.DataFrame([[9, 4, 8, 9],
[8, 10, 7, 6],
[7, 6, 8, 5]],
columns=['Maths', 'English',
'Science', 'History'])
print(df)
OUTPUT
31
DATA SCIENCE LAB MANUAL
1. df.sum()
2. df.describe()
We used agg() function to calculate the sum, min, and max of each column in our dataset.
Result :
Thus python programs for creating and accessing data frames using Pandas have been
executed and verified.
32
DATA SCIENCE LAB MANUAL
EX.No:4
Reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris
Date:
data set
Related CO & PO CO3 & PO1,2,3,4,5,9,10,11,12
Aim:
To read the data from text files, Excel and the web and exploring various commands for
doing descriptive analytics on the Iris data set
https://www.kaggle.com/datasets/arshid/iris-flower-dataset
Descriptive Analysis:
Descriptive analysis, also known as descriptive analytics or descriptive statistics, is the process
of using statistical techniques to describe or summarize a set of data.
Iris Dataset is considered as the Hello World for data science. It contains five columns namely
– Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a flowering
plant, the researchers have measured various features of the different iris flowers and recorded
them digitally.
Syntax:
data=pandas.read_csv(‘filename.txt’, sep=’ ‘, header=None, names=[“Column1”,
“Column2”])
33
DATA SCIENCE LAB MANUAL
Parameters:
filename.txt: As the name suggests it is the name of the text file from which we want to
read data.
sep: It is a separator field. In the text file, we use the space character(‘ ‘) as the separator.
header: This is an optional field. By default, it will take the first line of the text file as a
header. If we use header=None then it will create the header.
names: We can assign column names while importing the text file by using the names
argument.
Program:
# importing pandas
import pandas as pd
# read text file into pandas DataFrame
df = pd.read_csv("gfg.txt", sep=" ")
# display DataFrame
print(df)
Output:
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)
34
DATA SCIENCE LAB MANUAL
Program
import pandas as pd
df = pd.read_excel('C:\Users\Ron\Desktop\products.xlsx')
print(df)
Output
product_name price
0 computer 700
1 tablet 250
2 printer 120
3 laptop 1200
4 Keyboard 100
To load this CSV file, and we will convert it into the dataframe. read_csv() method is used to
read CSV files.import pandas as pd
35
DATA SCIENCE LAB MANUAL
OUTPUT
df.shape
Output:
(150, 6)
df.describe()(df.Mean(),df.std())
df.isnull().sum()
An Outlier is a data-item/object that deviates significantly from the rest of the (so-called
normal)objects.
# importing packages
importseaborn as sns
36
DATA SCIENCE LAB MANUAL
importmatplotlib.pyplot as plt
sns.boxplot(x='SepalWidthCm', data=df)
OUTPUT
# importing packages
importseaborn as sns
importmatplotlib.pyplot as plt
sns.countplot(x='Species', data=df, )
plt.show()
OUTPUT
# importing packages
importseaborn as sns
importmatplotlib.pyplot as plt
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',hue='Species', data=df, )
37
DATA SCIENCE LAB MANUAL
OUTPUT
Result:
Thus the data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set were successfully completed and executed.
38
DATA SCIENCE LAB MANUAL
EX.No:5
UCI and Pima Indians Diabetes data set
Date:
Aim:
To Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
Algorithm:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
.. ... ... ... .. ... ... ... ... .. ... ... ... ... ... ...
500 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4 1
501 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6 1
39
DATA SCIENCE LAB MANUAL
502 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9 1
503 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0 1
504 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9 1
[505 rows x 15 columns]
>>>df.describe()
0.00632 18 2.31 0 ... 396.9 4.98 24 1.1
15.3 18.461782
396.9 356.594376
4.98 12.668257
24 22.529901
1.1 1.000000
dtype: float64
>>>df.median()
0.00632 0.25915
18 0.00000
2.31 9.69000
0 0.00000
0.538 0.53800
6.575 6.20800
65.2 77.70000
4.09 3.19920
1 5.00000
296 330.00000
15.3 19.10000
396.9 391.43000
4.98 11.38000
24 21.20000
1.1 1.00000
dtype: float64
df.mode()
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 1.1
0 0.01501 0.0 18.1 0.0 0.538 5.713 100.0 3.4952 24.0 666.0 20.2 396.9 6.36 50.0 1.0
1 14.33370NaNNaNNaNNaN 6.127 NaNNaNNaNNaNNaNNaN 7.79 NaNNaN
41
DATA SCIENCE LAB MANUAL
>>>df.std()
0.00632 8.608572
18 23.343704
2.31 6.855868
0 0.254227
0.538 0.115990
6.575 0.703195
65.2 28.176371
4.09 2.107757
1 8.707553
296 168.629992
15.3 2.162520
396.9 91.367787
4.98 7.139950
24 9.205991
1.1 0.000000
dtype: float64
>>df.var()
0.00632 74.107509
18 544.928497
2.31 47.002931
0 0.064631
0.538 0.013454
6.575 0.494483
65.2 793.907894
4.09 4.442640
1 75.821484
296 28436.074242
15.3 4.676493
42
DATA SCIENCE LAB MANUAL
396.9 8348.072540
4.98 50.978891
24 84.750275
1.1 0.000000
dtype: float64
>>df.value_counts()
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 1.1
0.00906 90.0 2.97 0 0.400 7.088 20.8 7.3073 1 285 15.3 394.72 7.85 32.2 1 1
1.05393 0.0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21.0 386.85 6.58 23.1 1 1
1.41385 0.0 19.58 1 0.871 6.129 96.0 1.7494 5 403 14.7 321.02 15.12 17.0 1 1
1.38799 0.0 8.14 0 0.538 5.950 82.0 3.9900 4 307 21.0 232.60 27.71 13.2 1 1
1.35472 0.0 8.14 0 0.538 6.072 100.0 4.1750 4 307 21.0 376.73 13.04 14.5 1 1
..
0.11069 0.0 13.89 1 0.550 5.951 93.8 2.8893 5 276 16.4 396.90 17.92 21.5 1 1
0.11027 25.0 5.13 0 0.453 6.456 67.8 7.2255 8 284 19.7 396.90 6.73 22.2 1 1
0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0 1 1
0.10793 0.0 8.56 0 0.520 6.195 54.4 2.7778 5 384 20.9 393.49 13.00 21.7 1 1
88.97620 0.0 18.10 0 0.671 6.968 91.9 1.4165 24 666 20.2 396.90 17.21 10.4 1 1
Length: 505, dtype: int64
LINEAR ANALYSIS
>>> import pandas as pd
>>>df=pd.read_csv('C:\\Users\\Admin\\Downloads\\diabetes.csv')
>>>df
Pregnancies GlucoseBloodPressureSkinThickness Insulin BMI DiabetesPedigreeFunction
Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
.. ... ... ... ... ... ... ... ... ...
43
DATA SCIENCE LAB MANUAL
importstatsmodels.apiassm
OUTPUT
OLS Regression Results
===================================================================
===========
Dep. Variable: score R-squared: 0.794
Model: OLS Adj. R-squared: 0.783
Method: Least Squares F-statistic: 69.56
Date: Sat, 29 Oct 2022 Prob (F-statistic): 1.35e-07
Time: 08:36:32 Log-Likelihood: -55.886
No. Observations: 20 AIC: 115.8
44
DATA SCIENCE LAB MANUAL
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Univariate analysis
>>>df.mean()
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
>>>df.median()
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
45
DATA SCIENCE LAB MANUAL
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
>>>>>>df['Age'].mean()
33.240885416666664
>>>df.mode()
Pregnancies GlucoseBloodPressureSkinThickness Insulin BMI DiabetesPedigreeFunction
Age Outcome
0 1.0 99 70.0 0.0 0.0 32.0 0.254 22.0 0.0
1 NaN 100 NaNNaNNaNNaN 0.258 NaNNaN
>>>df.var()
Pregnancies 11.354056
Glucose 1022.248314
BloodPressure 374.647271
SkinThickness 254.473245
Insulin 13281.180078
BMI 62.159984
DiabetesPedigreeFunction 0.109779
Age 138.303046
Outcome 0.227483
dtype: float64
>>>df.std()
Pregnancies 3.369578
Glucose 31.972618
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
Age 11.760232
Outcome 0.476951
dtype: float64
frequency
>>>df['Age'].value_counts()
22 72
21 63
25 48
46
DATA SCIENCE LAB MANUAL
24 46
23 38
28 35
26 33
27 32
29 29
31 24
41 22
30 21
37 19
42 18
33 17
38 16
36 16
32 16
45 15
34 14
46 13
43 13
40 13
39 12
35 10
50 8
51 8
52 8
44 8
58 7
47 6
54 6
49 5
48 5
57 5
53 5
60 5
66 4
63 4
62 4
55 4
67 3
56 3
59 3
65 3
69 2
61 2
72 1
81 1
64 1
70 1
68 1
Name: Age, dtype: int64
>>>df.skew()
47
DATA SCIENCE LAB MANUAL
Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI -0.428982
DiabetesPedigreeFunction 1.919911
Age 1.129597
Outcome 0.635017
dtype: float64
>>>df.kurt()
Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
Age 0.643159
Outcome -1.600930
dtype: float64
Logistic
importstatsmodels.formula.apiassmf
Output
OLS Regression Results
===================================================================
===========
Dep. Variable: score R-squared: 0.794
Model: OLS Adj. R-squared: 0.783
Method: Least Squares F-statistic: 69.56
Date: Sat, 29 Oct 2022 Prob (F-statistic): 1.35e-
07 Time: 09:17:14 Log-Likelihood: -55.886
No. Observations: 20 AIC: 115.8
Df Residuals: 18 BIC: 117.8
Df Model: 1
Covariance Type: nonrobust
===================================================================
===========
coefstd err t P>|t| [0.025 0.975]
===================================================================
===========
Omnibus: 0.171 Durbin-Watson: 1.404
Prob(Omnibus): 0.918 Jarque-Bera (JB): 0.177
Skew: 0.165 Prob(JB): 0.915
Kurtosis: 2.679 Cond. No. 9.37
===================================================================
===========
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
MULTIPLE REGRESSION
import pandas as pd
importstatsmodels.api as sm
df=pd.read_csv('C:\\Users\\Admin\\Downloads\\diabetes.csv')
# withsklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# withstatsmodels
x = sm.add_constant(x) # adding a constant
print_model = model.summary()
print(print_model)
49
DATA SCIENCE LAB MANUAL
Result :
Thus Python programs are written for performing univariate,bivariate and multiple
linear regression analysis and executed successfully.
50
DATA SCIENCE LAB MANUAL
EX.No:6
Apply and explore various plotting functions on UCI data sets.
Date:
Aim:
To Apply and explore various plotting functions on UCI data sets for performing the
following
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots d. Histograms e. Three dimensional plotting
Normal Curve :
Program:
import pandas as pd
>>>df=pd.read_csv('C:\\Users\\Admin\\Downloads\\hou_all.csv')
>>>df
Output:
51
DATA SCIENCE LAB MANUAL
Program:
importmatplotlib.pyplot as plt
plt.plot(df.set)
[<matplotlib.lines.Line2D object at 0x000001AB280313F0>]
>>>plt.show()
Output:
Scatter Plot:
A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The scatter
diagram graphs numerical data pairs, with one variable on each axis, show their relationship.
Now the question comes for everyone:
Program:
>>plt.scatter(df.set,df.value)
<matplotlib.collections.PathCollection object at 0x000001AB27E9ED70>
>>>plt.show()
Output:
52
DATA SCIENCE LAB MANUAL
Histograms :
A histogram is a graphical representation that organizes a group of data points into user-
specified ranges and an approximate representation of the distribution of numerical data.
Syntax: hist(v,main,xlab,xlim,ylim,breaks,col,border)
Program:
plt.hist(df)
(array([[504., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
[474., 32., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[226., 280., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 17., 123., 130., 71., 28., 0., 0., 137.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 24., 12., 4., 9., 43., 414., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[506., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]), array([ 0. , 71.1, 142.2,
213.3, 284.4, 355.5, 426.6, 497.7, 568.8,
639.9, 711. ]), <a list of 15 BarContainer objects>)
>>>plt.show()
Output:
Program:
df.corr()
53
DATA SCIENCE LAB MANUAL
Output:
Contour Plot :
Program:
>>plt.contour(df)
<matplotlib.contour.QuadContourSet object at 0x000001AB27F0E740>
>>>plt.show()
Output:
Density Plot :
54
DATA SCIENCE LAB MANUAL
Syntax: density(x)
The most basic three-dimensional plot is a line or collection of scatter plot created from sets
of (x, y, z) triples.
Program:
Output:
Result :
Thus python programs for exploring various plots using matplotlib were executed
successfully.
55
DATA SCIENCE LAB MANUAL
EX.No:7
Visualizing Geographic Data with Basemap
Date:
C
Aim:
Algorithm:
Base Map
The common type of visualization in data science is that of geographic data. Matplotlib's main
tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib
toolkits which lives under the mpl_toolkits namespace.
Basemap is a matplotlib extension used to visualize and create geographical maps in python.
56
DATA SCIENCE LAB MANUAL
·Political boundaries
·Map features
·Whole-globe images
Program:
m = Basemap()
m.drawcoastlines()
plt.title("Coastlines", fontsize=20)
plt.show()
57
DATA SCIENCE LAB MANUAL
Output:
College Coordinates
Result:
Thus Base map toolkit was used to visualize the geographic data.
58
DATA SCIENCE LAB MANUAL
59
DATA SCIENCE LAB MANUAL
EX.No:8
Predict Income with Census Data
Date:
Aim
To develop Census Income Project Using Python.
Problem Statement:
The prediction task is to determine whether a person makes over $50K a year or less then.
Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotly.offline import iplot
import plotly as py
py.offline.init_notebook_mode(connected=True)from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,classification_reportimport warnings
warnings.filterwarnings('ignore')
60
DATA SCIENCE LAB MANUAL
df_census.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entrie s, 0 to 32559
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 32560 non-null int64
1 Workclass 32560 non-null object
2 Fnlwgt 32560 non-null int64
3 Education 32560 non-null object
4 Education_num 32560 non-null int64
5 Marital_status 32560 non-null object
6 Occupation 32560 non-null object
7 Relationship 32560 non-null object
8 Race 32560 non-null object
9 Sex 32560 non-null object
10 Capital_gain 32560 non-null int64
11 Capital_loss 32560 non-null int64
12 Hours_per_week 32560 non-null int64
13 Native_country 32560 non-null object
14 Income 32560 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
61
DATA SCIENCE LAB MANUAL
The data-set contains 32560 rows and 14 features + the target variable (Income). 6 are integers and 9
are objects. Below I have listed the features with a short description:
0 Age : The age of people
1 Workclass : Type of job
2 Fnlwgt : Final weight
3 Education : Education status
4 Education_num : Number of years of education in total
5 Marital_status : Marital_status
6 Occupation : Occupation
7 Relationship : Relationship
8 Race : residential segregation
9 Sex : Gender
10 Capital_gain : Capital gain is the profit one earns
11 Capital_loss : Capital gain is the profit one loose
12 Hours_per_week : Earning rate as per hrs
13 Native_country : Country
14 Income : Income (Target variable)df_census.describe()
df_census['Income'].value_counts().plot(kind='bar')
Result
62
DATA SCIENCE LAB MANUAL
SAMPLE EXERCISE
63
DATA SCIENCE LAB MANUAL
Write a NumPy program to create a null vector of size 10 and update sixth
value to 11
Python Code :
import numpy as np
x = np.zeros(10)
print(x)
x[6] = 11
print(x)
Sample Output:
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 11. 0. 0. 0.]
import numpy as np
import numpy as np
a = [1, 2, 3, 4]
print("Original array")
print(a)
x = np.asfarray(a)
print(x)
Sample Output:
Original array
[1, 2, 3, 4]
64
DATA SCIENCE LAB MANUAL
[ 1. 2. 3. 4.]
Write a NumPy program to create a 3x3 matrix with values ranging from 2 to 10
Python Code :
import numpy as np
x = np.arange(2, 11).reshape(3,3)
print(x)
Sample Output:
[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]
import numpy as np
print("Original List:",l)
a = np.array(l)
Sample Output:
Python Code :
import numpy as np
import numpy as np
65
DATA SCIENCE LAB MANUAL
a = [1, 2, 3, 4]
print("Original array")
print(a)
x = np.asfarray(a)
print(x)
Sample Output
Original array
[1, 2, 3, 4]
Array converted to a float type:
[ 1. 2. 3. 4.]
import numpy as np
x = np.empty((3,4))
print(x)
y = np.full((3,3),6)
print(y)
Sample Output:
66
DATA SCIENCE LAB MANUAL
import numpy as np
my_list = [1, 2, 3, 4, 5, 6, 7, 8]
print(np.asarray(my_list))
print(np.asarray(my_tuple))
Sample Output:
List to array:
[1 2 3 4 5 6 7 8]
Tuple to array:
[[8 4 6]
[1 2 3]]
Write a NumPy program to find the real and imaginary parts of an array of complex numbers
Python Code :
import numpy as np
x = np.sqrt ([1+0j])
y = np.sqrt ([0+1j])
print(x.real)
print(y.real)
print(x.imag)
print(y.imag)
Sample Output
67
DATA SCIENCE LAB MANUAL
[ 0.70710678]
Imaginary part of the array:
[ 0.]
[ 0.70710678]
Write a Pandas program to get the powers of an array values element-wise.
Note: First array elements raised to powers from second array
Expected Output:
XYZ
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83
Python Code :
import pandas as pd
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]});
print(df)
Sample Output:
X Y Z
0 78 8 86
4
1 85 9 97
4
2 96 8 96
9
3 80 8 72
3
4 86 86 83
Write a Pandas program to select the specified columns and rows from a given
data frame.Sample Python dictionary data and list labels:
Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the following data
frame. exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew','Laura', 'Kevin', 'Jonas'],'score': [12.5, 9, 16.5, np.nan, 9,
20, 14.5, np.nan, 8, 19],
68
DATA SCIENCE LAB MANUAL
69
DATA SCIENCE LAB MANUAL
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Expected Output:
Select specific columns and rows:
score
quali
fyb
9.0
no
d NaN nof
20.0 yes
g 14.5 yes
Python Code :
import pandas as pd
import numpy as np
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
import pandas as pd
import numpy as np
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
total_rows=len(df.axes[0])
total_cols=len(df.axes[1])
Sample Output:
Number of Rows: 10
Number of Columns: 4
Reading data from text files, Excel and the web and exploring various
commands for doingdescriptive analytics on the Iris data set
Use the diabetes data set from Pima Indians Diabetes data set for performing
thefollowing:
71
DATA SCIENCE LAB MANUAL
Frequency
Mean,
Median,
Mode,
Variance
Standard Deviation
Skewness and Kurtosis
Use the diabetes data set from Pima Indians Diabetes data set for
performing thefollowing:
Apply Bivariate analysis:
Use the diabetes data set from Pima Indians Diabetes data set for performing
thefollowing:
72
DATA SCIENCE LAB MANUAL
print(data)
print(target)
#defining feature matrix(x) and response vector(y)
X=data
Y=target
#splitting X and Y into triaining and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train =train_test_split(X,Y, test_size=0.3,
random_state=101)
Apply and explore various plotting functions on UCI data set for performing the
following:
a) Normal values
b) Density and contour plots
c) Three-dimensional plotting
Apply and explore various plotting functions on UCI data set for performing the
following:
Apply and explore various plotting functions on Pima Indians Diabetes data set for
performing the following:
a) Normal values
b) Density and contour plots
c) Three-dimensional plotting
Apply and explore various plotting functions on Pima Indians Diabetes data
set forperforming the following:
Sample Output:
Original
73
DATA SCIENCE LAB MANUAL
DataFrame
col1 col2
col3
0147
1258
2 3 6 12
3491
4 7 5 11
Number of columns:
3
Python Code :
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print("Original DataFrame")
print(df)
print("\nNumber of columns:")
print(len(df.columns))
Write a Pandas program to group by the first column and get second
column as lists inrows
Sample data:
Original
DataFrame
col1 col2
0 C1 1
1 C1 2
2 C2 3
3 C2 3
4 C2 4
5 C3 6
6 C2 5
Group on the col1:
col1
C1 [1, 2]
C2 [3, 3, 4, 5]
C3 [6]
Name: col2, dtype: object
74
DATA SCIENCE LAB MANUAL
Python Code :
import pandas as pd
print("Original DataFrame")
print(df)
df = df.groupby('col1')['col2'].apply(list)
print(df)
Sample Output:
Original DataFrame
col1 col2
0 C1 1
1 C1 2
2 C2 3
3 C2 3
4 C2 4
5 C3 6
6 C2 5
Sample data:
Original
DataFrame
col1 col2
col3
75
DATA SCIENCE LAB MANUAL
0147
1258
2 3 6 12
3491
4 7 5 11
Col4 is not
present in
DataFrame.Col1
is present in
DataFrame.
Python Code :
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print("Original DataFrame")
print(df)
if 'col4' in df.columns:
else:
if 'col1' in df.columns:
else:
Sample Output:
Original DataFrame
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 12
3 4 9 1
4 7 5 11
Col4 is not present in DataFrame.
Col1 is present in DataFrame.
76
DATA SCIENCE LAB MANUAL
Sample Output:
Original arrays:
[ 10 -10 10 -10 -10 10]
[0.85 0.45 0.9 0.8 0.12 0.6 ]
Number of instances of a value occurring in one array on the
condition of another array:3
Python Code :
import numpy as np
x = np.array([10,-10,10,-10,-10,10])
y = np.array([.85,.45,.9,.8,.12,.6])
print("Original arrays:")
print(x)
print(y)
print(result)
Sample Output:
Original arrays:
Number of instances of a value occurring in one array on the condition of another array:
77
DATA SCIENCE LAB MANUAL
Python Code:
import numpy as np
print(np_array)
print("Type: ",type(np_array))
print("Sequence: 1,2",)
Sample Output:
[[1 2 3]
[2 1 2]]
Sequence: 1,2
78
DATA SCIENCE LAB MANUAL
Write a NumPy program to merge three given NumPy arrays of same shape
Python Code :
import numpy as np
print("Original arrays:")
print(arr1)
print(arr2)
print(arr3)
nAfter concatenate:")
print(result)
Sample Output:
Original arrays:
[[[2.42789481e-01]
[4.92252795e-01]
[9.33448807e-01]
[7.25450297e-01]
[9.74093474e-02]
[5.68505405e-01]
[9.65681560e-01]
[7.94931731e-01]
[7.52893987e-01]
[5.43942380e-01]
[9.38096939e-01]
[8.38066653e-01]
[7.83185689e-01]
[4.22962615e-02]
[2.96843761e-01]
[9.50102088e-01]
[6.36912393e-01]
[3.75066692e-03]
[6.03600756e-01]
[4.22466907e-01]
[3.23442622e-01]
[7.23251484e-02]
[1.49598420e-01]
[5.45714254e-01]
79
DATA SCIENCE LAB MANUAL
[9.59122727e-01]]
.........
After concatenate:
[[[0.24278948 0.41363799 0.00761597]
[0.4922528 0.8033311 0.73833312]
[0.93344881 0.55224706 0.72935665]
...
[0.14959842 0.26294052 0.63326384]
[0.54571425 0.82177763 0.7713901 ]
[0.95912273 0.39791879 0.7461949 ]]
Write a NumPy program to combine last element with first element of two
given ndarraywith different shapes.
Sample Output:
Original arrays:
['PHP', 'JS', 'C++']
['Pyt
hon',
'C#',
'Num
Py']
After
Com
binin
g:
['PHP' 'JS' 'C++Python' 'C#' 'NumPy']
Python Code :
import numpy as np
80
array1 = ['PHP','JS','C++']
print("Original arrays:")
print(array1)
print(array2)
nAfter Combining:")
print(result)
Sample Output:
Original arrays:
['PHP', 'JS', 'C++']
['Python', 'C#', 'NumPy']
After Combining:
['PHP' 'JS' 'C++Python' 'C#' 'NumPy']