New FDS Lab
New FDS Lab
New FDS Lab
NO LIST OF EXPERIMENTS
1 DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF NUMPY, SCIPY, JUPYTER, STATS MOD-
ELS AND PANDAS PACKAGES.
4 READING DATA FROM TEXT FILES, EXCEL AND THE WEB AND EXPLORING VARIOUS COMMANDS
FOR DOING DESCRIPTIVE ANALYTICS ON THE IRIS DATA SET
4-A READING DATA FROM EXCEL AND CSV FILE
5 USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS DIABETES DATA SET FOR PER-
FORMING THE FOLLOWING
5-A UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE, STANDARD DEVIA-
TION,
SKEWNESS AND KURTOSIS
5-B BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION MODELING
6-D HISTOGRAMS
AIM :
To download, install and explore the features of NumPy, SciPy, Jupyter, Stats models and Pandas pack-
ages.
Step 2 : To download packages use pip download command (PIP - Pip installs Python)
Step 4 : To view the list of installed packages use pip list command
PACKAGE DESCRIPTION
Python Libraries
There are a lot of reasons why Python is popular among developers and one of them is that it has an amaz -
ingly large collection of libraries that users can work with. In this Python Library, we will discuss Python Stan-
dard library and different libraries offered by Python Programming Language: scipy, numpy, etc.
A module is a file with some Python code, and a package is a directory for sub packages and modules. A
Python library is a reusable chunk of code that you may want to include in your programs/ projects.
Python Standard Library
The Python Standard Library is a collection of script modules accessible to a Python program to simplify the
programming process and removing the need to rewrite commonly used commands. They can be used by
'calling/importing' them at the beginning of a script. A list of the Standard Library modules that is most impor-
tant
time
sys
csv
math
random
pip
os
statistics
tkinter
socket
To display a list of all available modules, use the following command in the Python console: >>> help('mod-
ules')
NumPy (Numerical Python) is a linear algebra library in Python. It is a very important library on which
almost every data science or machine learning Python packages such as SciPy (Scientific Python), Mat−plotlib
(plotting library), Scikit-learn, etc depends on to a reasonable extent. NumPy is very useful for performing
mathematical and logical operations on Arrays. It provides an abundance of useful features for operations on
n-arrays and matrices in Python.
NumPy deals with multi-dimensional arrays and matrices. On top of the arrays and matrices, NumPy supports
a large number of mathematical operations.
NumPy is memory efficiency, meaning it can handle the vast amount of data more accessible than any other li-
brary. Besides, NumPy is very convenient to work with, especially for matrix multiplication and reshaping. On
top of that, NumPy is fast. In fact, TensorFlow and Scikit learn to use NumPy array to compute the matrix mul-
tiplication in the back end.
Arrays in NumPy:
It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive inte-
gers.
NumPy’s array class is called ndarray. It is also known by the alias array.
We use python numpy array instead of a list because of the below three reasons:
Less Memory
Fast
Convenient
Here is a list of some useful NumPy functions and methods names ordered in categories.
Array Creation
arange, array, copy, empty, empty_like, eye, fromfile, fromfunction, identity, linspace, logspace, mgrid, ogrid,
ones, ones_like, r, zeros, zeros_like
Conversions
ndarray.astype, atleast_1d, atleast_2d, atleast_3d, mat
Manipulations
array_split, column_stack, concatenate, diagonal, dsplit, dstack, hsplit, hstack, ndarray.item, newaxis, ravel,
repeat, reshape, resize, squeeze, swapaxes, take, transpose, vsplit, vstack
Questions
Ordering
Operations
choose, compress, cumprod, cumsum, inner, ndarray.fill, imag, prod, put, putmask, real, sum
Basic Statistics
SciPy is an Open Source Python-based library, which is used in mathematics, scientific computing, Engi-
neering, and technical computing. SciPy also pronounced as "Sigh Pi."
SciPy contains varieties of sub packages which help to solve the most common issue related to Scien-
tific Computation.
SciPy is the most used Scientific library only second to GNU Scientific Library for C/C++ or Matlab's.
Numpy VS SciPy
Numpy:
1. Numpy is written in C and used for mathematical or numeric calculation.
3. Numpy is the most useful library for Data Science to perform basic calculations.
4. Numpy contains nothing but array data type which performs the most basic operation like sorting, shaping,
indexing, etc.
SciPy:
2. SciPy is a fully-featured version of Linear Algebra while Numpy contains only a few features.
3. Most new Data Science features are available in Scipy rather than Numpy.
Scipy.interpolate Interpolation
Scipy.stats Statistics
The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection
of Series.
DataFrames and Series are quite similar in that many operations that you can do with one you can do
with the other, such as filling in null values and calculating the mean.
Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate
rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:
temp_df = movies_df.append(movies_df)
temp_df.shape
Out:
(2000, 11)
Using append() will return a copy without affecting the original DataFrame. We are capturing this copy in temp
so we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled. Now we can try dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your DataFrame, but this time with
duplicates removed. Calling .shape confirms we're back to the 1000 rows of our original dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this reason,
pandas has the inplace keyword argument on many of its methods. Using inplace=True will modify the
DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Now our temp_df will have the transformed data automatically.
Another important argument for drop_duplicates() is keep, which has three possible options:
first: (default) Drop duplicates except for the first occurrence.
last: Drop duplicates except for the last occurrence.
False: Drop all duplicates.
Since we didn't define the keep arugment in the previous example it was defaulted to first. This means that if
two rows are the same pandas will drop the second row and keep the first row. Using last has the opposite ef -
fect: the first row is dropped.
keep, on the other hand, will drop all duplicates. If two rows are the same then both will be dropped. Watch
what happens to temp_df:
temp_df = movies_df.append(movies_df) # make a new copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left over. If you're
wondering why you would want to do this, one reason is that it allows you to locate all duplicates in your
dataset. When conditional selections are shown below you'll see how to do that.
Column cleanup
Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and
typos. To make selecting data by column name easier we can spend a little time cleaning up their names.
Here's how to print the column names of our dataset:
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')
Not only does .columns come in handy if you want to rename columns by allowing for simple copy and paste,
it's also useful if you need to understand why you are receiving a Key Error when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. We don't want parentheses, so
let's rename those:
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')
Excellent. But what if we want to lowercase all names? Instead of using .rename() we could also set a list of
names to the columns like so:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
But that's too much work. Instead of just renaming each column manually we can do a list comprehension:
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
list (and dict) comprehensions come in handy a lot when working with pandas and data in general.
It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be
working with a dataset for some time.
How to work with missing values
When exploring data, you‘ll most likely encounter missing or null values, which are essentially place -
holders for non-existent values. Most commonly you'll see Python's None or NumPy's np.nan, each of which
are handled differently in some situations.
There are two options in dealing with nulls:
1. Get rid of rows or columns with nulls
To make necessary statistical inferences, it becomes necessary to visualize your data and Matplotlib is one
such solution for the Python users. It is a very powerful plotting library useful for those working with Python
and NumPy. The most used module of Matplotib is Pyplot which provides an interface like MATLAB but in-
stead, it uses Python and it is open source.
General Concepts
2. Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains two or three (in
the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.
3. Axis: They are the number line like objects and take care of generating the graph limits.
4. Artist: Everything which one can see on the figure is an artist like Text objects, Line2D objects, collection ob -
jects. Most Artists are tied to Axes.
Matplotlib Library
Pyplot is a module of Matplotlib which provides simple functions to add plot elements like lines, images, text,
etc. to the current axes in the current figure.
plot(x-axis values, y-axis values) — plots a simple line graph with x-axis values against y-axis values
show() — displays the graph
title(―string) — set the title of the plot as specified by the string
xlabel(―string) — set the label for x-axis as specified by the string
ylabel(―string) — set the label for y-axis as specified by the string
figure() — used to control a figure level attributes
subplot(nrows, ncols, index) — Add a subplot to the current figure
suptitle(―string) — It adds a common title to the figure specified by the string
subplots(nrows, ncols, figsize) — a convenient way to create subplots, in a single call. It returns a tuple
of a figure and number of axes.
set_title(―string) — an axes level method used to set the title of subplots in a figure
bar(categorical variables, values, color) — used to create vertical bar graphs
barh(categorical variables, values, color) — used to create horizontal bar graphs
legend(loc) — used to make legend of the graph
xticks(index, categorical variables) — Get or set the current tick locations and labels of the x-axis
pie(value, categorical variables) — used to create a pie chart
hist(values, number of bins) — used to create a histogram
xlim(start value, end value) — used to set the limit of values of the x-axis
ylim(start value, end value) — used to set the limit of values of the y-axis
scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values against y-axis values
axes() — adds an axes to the current figure
set_xlabel(―string) — axes level method used to set the x-label of the plot specified as a string
set_ylabel(―string‖) — axes level method used to set the y-label of the plot specified as a string
scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot with x-axis values
against y-axis values
plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph with x-axis values against y-
axis values
2(A) CREATE 1D ,2D,ARANGE,RANDOM
# 1D array using numpy
PROGRAM:
import numpy as np
arr = np.array([10,30,50,60,90])
print(arr)
OUTPUT:
arr= [10 30 50 70 90]
# 2D array using numpy
import numpy as np
arr = np.array([[20,40,60,80,100], [10,30,50,70,90]])
print("arr=",arr)
OUTPUT:
arr= [[ 20 40 60 80 100]
[ 10 30 50 70 90]]
#arange using numpy
PROGRAM:
import numpy as geek
print("A\n", geek.arange(4).reshape(2, 2), "\n")
print("A\n", geek.arange(3,9), "\n")
print("A\n", geek.arange(3, 19, 2), "\n")
OUTPUT:
A
[[0 1]
[2 3]]
A
[3 4 5 6 7 8]
A
[ 3 5 7 9 11 13 15 17]
#random using numpy
PROGRAM:
from numpy import random
a = random.randint(10)
print("a=",a)
OUTPUT:
a= 7
2(b) CONCATENATE,APPEND, TO LIST ,ASARRAY USING NUMPY
# concatenate two arrays using numpy
PROGRAM:
import numpy as np
arr1 = np.array([10,20,30])
arr2 = np.array([40,50,60])
arr3 = np.hstack((arr1, arr2))
print("arr3=",arr3)
OUTPUT:
arr3= [10 20 30 40 50 60]
#append two arrays using numpy
PROGRAM:
import numpy as np
arr1=np.array([10,20,30])
arr2=np.array([40,50,60])
arr3=np.append(arr1,arr2)
print(“Append arr3:”,arr3)
OUTPUT:
Append arr3:[10 20 30 40 50 60 ]
#array to list using numpy
PROGRAM:
import numpy as np
arr_1 = np.array([1, 2, 3])
print(f'NumPy Array:\n{arr_1}')
OUTPUT:
NumPy Array:
[1 2 3]
#asarray using numpy
PROGRAM:
import numpy as geek
my_list = [1, 3, 5, 7, 9]
print ("Input list : ", my_list)
out_arr = geek.asarray(my_list)
print ("output array from input list : ", out_arr)
OUTPUT:
Input list : [1, 3, 5, 7, 9]
output array from input list : [1 3 5 7 9]
2(C) MAXIMUM MINIMUM AND SORTING OF ARRAYS
#minimum values of rows and column
PROGRAM:
import numpy as np
s= np.array([[7,1,2,3],[89,2,4,5],[2,3,7,1],[3,4,1,17]])
print(s)
z=np.argmin((s),axis=0)
x=np.argmin((s),axis=1)
print("min values of column:",z)
print("min value of row:",x)
OUTPUT:
[[ 7 1 2 3]
[89 2 4 5]
[ 2 3 7 1]
[ 3 4 1 17]]
min values of column: [2 0 3 2]
min value of row: [1 1 3 2]
OUTPUT:
[[ 7 1 2 3]
[89 2 4 5]
[ 2 3 7 1]
[ 3 4 1 17]]
max values of column: [1 3 2 3]
max value of row: [0 0 2 3]
# sort array
PROGRAM:
import numpy as n
s= n.array([[10,9,8],[7,6,5],[4,3,2],[1,0,5]])
print(s)
y=n.sort((s),axis=0)
z=n.sort((s),axis=1)
print("column \n",y)
print("row \n",z)
OUTPUT:
[[10 9 8]
[ 7 6 5]
[ 4 3 2]
[ 1 0 5]]
column
[[ 1 0 2]
[ 4 3 5]
[ 7 6 5]
[10 9 8]]
row
[[ 8 9 10]
[ 5 6 7]
[ 2 3 4]
[ 0 1 5]]
2(D) ARRAY OPERATORS
# Numpy Array Operators
PROGRAM:
import numpy as np
arr1=np.array([4,6,20])
arr2=np.array([2,3,4])
add1=np.add(arr1,arr2)
print(“Add two arrays :”,add1)
sub1=np.subtract(arr1,arr2)
print(“Subtract two arrays:”,sub1)
mul1=np.multiply(arr1,arr2)
print(“Multiply two arrays:”,mul1)
div1=np.divide(arr1,arr2)
print(“Divide two arrays:”,div1)
mod1=np.mod(arr1,arr2)
print(“Mod of two arrays:”,mod1)
pow1=np.power(arr1,arr2)
print(“Power of two arrays:”,pow1)
reci1=np.reciprocal (arr1)
print(“Reciprocal of array is:”,reci1)
OUTPUT:
Add two arrays:[6 9 24]
Subtract two arrays:[2 3 16]
Multiply two arrays;[8 18 80]
Diviide two arrays:[2 2 5]
Mod of two arrays[0 0 0]
Power of two arrays[16 216 160000]
Reciprocal of two arrays[0 0 0]
#zeros and ones of arrays using numpy()
PROGRAM:
import numpy as np
arr1=np.zeros(5)
arr2=no.ones(5)
print(“Zeros array is:”,arr1)
print(“Ones array is:”,arr2)
OUTPUT:
Zeros array is:[0.0.0.0.0]
Ones array is:[1.1.1.1.1]
Ex.No:3
Working with Pandas DataFrames
PROGRAM:
import pandas as pd
import numpy as np
sub = [ ["AI",120000,"30days"],["Physics",220000, "40days"],["french", 160000,"6months"]] df=pd.DataFrame(sub)
print(df)
column_name=["courses","Fee","Duration"] row_lable=["a","b","c"] df=pd.DataFrame(sub,columns=column_name,in-
dex=row_lable)
print(df) print(df.dtypes)
# Create DataFrame from Dictionary
sub ={'Courses':["ML","DATA ANALYSIS","3D PRINTING"],
'Fee':[120000,125000,126000],'Duration':['37day','60days','72days'],'Discount':[10000,20300,10500]
}
df = pd.DataFrame(sub)
print(df)
# Create DataFrame
sub = ({'Courses':["HTML","SQL","Python","C","Java","C++","C#","Java Script"],
'Fee':[122000,53000,36000,47000,80000,np.nan,45000,82000],
'Duration':['70day','60days','45days', '70days','40days',' ','71days','30days'],
'Discount':[11000,12300,11000,11200,21500,13100,11400,16100] })
df = pd.DataFrame(sub, index=['r0','r1','r2','r3','r4','r5','r6','r7'])
print(df)
# Create Series from array
import pandas as pd
import numpy as np
data = np.array(["SUNDAR",21,2004])
series = pd.Series(data)
print ("********Create Series from array ********\n", series)
OUTPUT:
0 1 2
0 AI 120000 30days
1 Physics 220000 40days
2 french 160000 6months
courses Fee Duration
a AI 120000 30days
b Physics 220000 40days
c french 160000 6months
courses object
Fee int64 Duration
object dtype: object
Courses Fee Duration Discount
0 ML 120000 37day 10000
1 DATA ANALYSIS 125000 60days 20300
2 3D PRINTING 126000 72days 10500
Program:
import pandas as pd
df = pd.read_csv(‘example1.csv’,names=[“date”,”time”,”code”,”values”])
print(pd.unique(df[“date”]))
df = pd.read_excel(‘example.xlsx’, names = [“s.no”,”spr.no”,”name”])
print(df.to_string())
print(“print 1st five line”)
print(df.head(5))
print(df.tail(5))
# to print columns
print(df.columns)
# to print specific column
print(df[[‘name’,’spr.no’]].to_string())
# to describe the column spr no’
print(df[‘spr.no’].describe())
# to slice the rows
print(df.iloc[5:9])
# to read specific cell
print(df.iloc[9,2])
# to print row wise
for index, row in df.iterrows():
print(row)
print(“*******************”)
print(df.loc[df[‘spr.no’]==”8507”])
print(df.describe())
print(“*************** sorting************”)
print(df.sort_values([‘name’], ascending=false).to_string())
OUTPUT
[‘date’ ‘1.2.2022’ ‘4.6.2022’ ’23.6.2022’ ‘5.7.2022’ ‘6.8.2022’
’30.4.2022’ ‘5.10.2022’ ‘3.12.2022’ ‘5.12.2022’]
s.no spr.no name
0 1 8501 arun
1 2 8502 dhanush
2 3 8503 surya
3 4 8504 karthi
4 5 8505 raja
5 6 8506 mukesh
6 7 8507 mani
7 8 8508 ram
8 9 8509 vijay
9 10 8510 tamil
Count 10.00000
Mean 8505.50000
Std 3.02765
Min 8501.00000
25% 8503.25000
50% 8505.50000
75% 8507.75000
Max 8510.00000
Name: spr.no, dtype: float64
s.no spr.no name
5 6 8506 mukesh
6 7 8507 mani
7 8 8508 ram
8 9 8509 vijay
Tamil
s.no 1
spr.no 8501
name arun
Name: 0, dtype: object
s.no 2
spr.no 8502
name dhanush
Name: 1, dtype: object
s.no 3
spr.no 8503
name surya
Name: 2, dtype: object
s.no 4
spr.no 8504
name karthi
Name: 3, dtype: object
s.no 5
spr.no 8505
name raja
Name: 4, dtype: object
s.no 6
spr.no 8506
name mukesh
Name: 5, dtype: object
s.no 7
spr.no 8507
name mani
Name: 6, dtype: object
s.no 8
spr.no 8508
name ram
Name: 7, dtype: object
s.no 9
spr.no 8509
name vijay
Name: 8, dtype: object
s.no 10
spr.no 8510
name tamil
Name: 9, dtype: object
**********************************
s.no spr.no
count 10.00000 10.00000
mean 5.50000 8505.50000
std 3.02765 3.02765
min 1.00000 8501.00000
25% 3.25000 8503.25000
50% 5.50000 8505.50000
75% 7.75000 8507.75000
Max 10.00000 8510.00000
*************** Sorting************
s.no spr.no name
8 9 8509 vijay
9 10 8510 tamil
2 3 8503 surya
7 8 8508 ram
4 5 8505 raja
5 6 8506 mukesh
6 7 8507 mani
3 4 8504 karthi
1 2 8502 dhanush
0 1 8501 arun
EX.NO:4(B)
books = []
url = f"https://books.toscrape.com/catalogue/page-{i}.html"
response = requests.get(url)
response = response.content
soup = BeautifulSoup(response, 'html.parser')
ol = soup.find('ol')
articles = ol.find_all('article', class_='product_pod')
PROGRAM:
OUTPUT:
Function
7 Age 768 non-null int64 8 Outcome 768 non-null int64
None
MODE:
[2 rows x 9 columns]
MEDIAN:
Pregnancies 3.0000
Glucose 117.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
STANDARD DEVIATION:
Pregnancies 3.369578
Glucose 31.972618
BloodPressure 19.355807
SkinThickness 15.952218
Insulin 115.244002
BMI 7.884160
DiabetesPedigreeFunction 0.331329
Age 11.760232 Outcome
VARIANCE:
Pregnancies 0.901674
Glucose 0.173754
BloodPressure -1.843608
SkinThickness 0.109372
Insulin 2.272251
BMI -0.428982
DiabetesPedigreeFunction 1.919911
KURTOSIS:
Pregnancies 0.159220
Glucose 0.640780
BloodPressure 5.180157
SkinThickness -0.520072
Insulin 7.214260
BMI 3.290443
DiabetesPedigreeFunction 5.594954
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes =datasets.load_diabetes()
print(diabetes.DESCR)
print(diabetes.feature_names)
X = diabetes.data
Y = diabetes.target
print(X.shape, Y.shape)
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)
print(train_x.shape, train_y.shape)
from sklearn.linear_model import
LinearRegression le = LinearRegression()
le.fit(train_x,train_y)
y_pred = le.predict(test_x)
print(y_pred)
result = pd.DataFrame({'Actual': test_y, 'Predict' : y_pred})
print(result.to_string())
print('coefficient', le.coef_)
print('intercept', le.intercept_)
from sklearn.metrics import mean_squared_error, r2_score
print(mean_squared_error(test_y,y_pred))
print(r2_score(test_y,y_pred))
plt.scatter(test_y,y_pred, color = 'green')
plt.xticks(())
plt.yticks(())
plt.show()
OUTPUT:
Diabetes dataset
--
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation
times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression,"
Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
(442, 10) (442,)
(309, 10) (309,)
[ 77.9991034 170.44712136 109.03660582 223.84307065 87.38430375
211.46934588 223.65994161 52.81888351 149.39008619 294.9893952
127.72956377 182.90399415 102.6375881 144.69460471 171.5213581
266.17922455 201.8877034 166.18209687 103.67082991 169.01866852
187.13785171 130.10161695 151.54411583 156.45841795 121.85578431
304.14149903 126.55107098 158.76430506 249.42202516 154.22310276
180.85528758 180.06909853 182.96467155 200.00132316 73.87031841
146.19357531 165.52522033 160.93215348 247.99028596 210.37177183
85.69905264 211.07539533 188.10590426 119.60434815 151.80766971
188.31163328 185.69251949 168.92581539 291.55993431 248.60092291
170.17035216 208.5515447 59.08071813 195.30432554 190.19923551
149.97489689 114.4835119 244.83078249 254.54782428 138.88949628
301.05425333 57.71483254 162.93256009 187.59937115 274.76599026
158.21099102 181.84517605 151.29485495 199.55472957 197.6359619
83.37352464 127.99031348 179.13616685 161.69195267 106.02168539
213.44405941 99.63818719 116.80068856 176.2415266 190.88134043
167.39910679 185.39753187 120.19155677 94.95663036 158.22520798
189.80858086 166.40135891 235.30476439 230.3206865 236.6916711
176.02450278 135.15234363 152.82007506 124.46215244 213.574207
42.64060377 200.72397971 64.05606538 160.31380961 61.10040585
161.21565469 169.64372088 198.39611428 63.92041427 129.65961388
171.88537096 175.28400448 126.28643791 163.35396162 92.43346316
156.32693509 174.65381761 229.30521464 128.22372219 124.33529553
64.52215982 217.54827171 230.22439161 190.07750782 161.96910026
123.80090397 161.67791209 132.53128138 116.96403575 266.68910399
275.9219297 86.84652207 60.94725563 202.34939075 144.66352278
82.4008924 185.44354389 101.56321284]
coefficient [ 40.66179183 -313.2934701 517.18345014 386.06561935 -604.64289258
275.31448303 3.92015562 172.39101348 661.95332633 62.26068823]
intercept 155.59120125597175
3157.972848565651
0.4545709909725648
3157.972848565651
Ex.no:5-C
MULTIPLE REGRESSION ANALYSIS
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import seaborn as
sns
from sklearn import datasets,
linear_model,metrics
df = pd.read_csv('C:\\Users\\
Nellish\\Downloads\\
diabetes.csv')
data = df[['Age','Glucose','BMI',
'BloodPressure','Pregnancies']] target = df[['Out-
come']] print(data)
print(target) print(df.isna().sum())
X= data
Y = target
#Split in to training and test data
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3,random_state =101)
reg = linear_model.LinearRegression()
reg.fit(X_train, Y_train)
Y_predict = reg.predict(X_test)
print("Coeeficient: ",reg.coef_)
print("Variance Score:{}".format(reg.score(X_test, Y_test)))
from sklearn.metrics import r2_score
print("r2:",r2_score(Y_test, Y_predict))
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, Y_predict)
rmse = np.sqrt(mse)
print("RMSE:",rmse)
plt.figure(figsize =(12,10))
sns.pairplot(df[['Age','Glucose','BMI', 'BloodPressure','Pregnancies',
'Outcome']])
plt.show()
plt.figure(figsize =(8,6))
sns.lmplot(x="BMI", y="Glucose", data=df)
plt.show()
OUTPUT:
std = norm.fit(data)
plt.title(title)
plt.show()
OUTPUT:
EX.NO:6(B)
Apply and explore various plotting functions on UCI data sets.
DENSITY CONTOUR PLOTS
PROGRAM :
OUTPUT
EX.NO:6(C)
PROGRAM :
pd.read_csv("D:\DataScience\Iris.csv") sns.set_style("white-
grid");
OUTPUT
OUTPUT:
EX.NO:6(D)
PROGRAM:
import pandas as pd
import numpy as np
data =pd.read_csv("D:\DataScience\Iris.csv")
sns.set_style("whitegrid");
OUTPUT
#Histogram
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv("D:\DataScience\Iris.csv")
sns.FacetGrid(data, hue="species", height=5) \.map(sns.hist-
plot, "petal_length") \.add_legend()
plt.show()
sns.FacetGrid(data, hue="species", height=5) \.map(sns.hist-
plot, "petal_width") \.add_legend()
plt.show()
sns.FacetGrid(data, hue="species", height=5) \.map(sns.hist-
plot, "SepalLengthCm") \.add_legend()
plt.show()
OUTPUT:
EX.NO:6(E)
PROGRAM :
#3D-Line
np
fig = plt.figure()
ax = plt.axes(projection='3d')
z = np.linspace(0, 1, 100)
x = z * np.sin(20 * z)
y = z * np.cos(20 * z)
plt.show()
OUTPUT:
Ex. No 7 :
Visualizing Geographic Data with Basemap
# Geographic data with Basemap
import numpy as np
lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
# Map
plt.show()
OUTPUT:
# Coastal Lines
import numpy as np
import matplotlib.pyplot as plt from mpl_tool-
fig = plt.figure(figsize=(12,12))
m= Basemap()
m.drawcoastlines()
plt.show()
OUTPUT
#Orthographic Projection
import numpy as np
m.fillcontinents(color='tan',lake_color='lightblue')
OUTPUT :