Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

New FDS Lab

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 53

EX.

NO LIST OF EXPERIMENTS

1 DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF NUMPY, SCIPY, JUPYTER, STATS MOD-
ELS AND PANDAS PACKAGES.

2 WORKING WITH NUMPY ARRAYS

2-A CREATE 1D,2D ARRANGE,RANDOM

2-B CONCATENATE,APPEND,TO LIST,ASARRAY

2-C MAXIMUM MINIMUM AND SORTING OF ARRAYS

2-D ARRAY OPERATORS

3 WORKING WITH PANDAS DATA FRAMES

4 READING DATA FROM TEXT FILES, EXCEL AND THE WEB AND EXPLORING VARIOUS COMMANDS
FOR DOING DESCRIPTIVE ANALYTICS ON THE IRIS DATA SET
4-A READING DATA FROM EXCEL AND CSV FILE

4-B READ FROM TEXT FILE

4-C READ WEB SCRAPPING(BEAUTIFUL SOUP)

5 USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS DIABETES DATA SET FOR PER-
FORMING THE FOLLOWING

5-A UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE, STANDARD DEVIA-
TION,
SKEWNESS AND KURTOSIS
5-B BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION MODELING

5-C MULTIPLE REGRESSION ANALYSIS

6 APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA SETS

6-A NORMAL CURVES

6-B DENSITY AND CONTOUR PLOTS

6-C CORRELATION AND SCATTER PLOTS

6-D HISTOGRAMS

6-E THREE DIMENSIONAL PLOTTING

7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP


Ex.No.1 EXPLORE THE FEATURES OF PYTHON PACKAGES

AIM :

To download, install and explore the features of NumPy, SciPy, Jupyter, Stats models and Pandas pack-
ages.

PROCEDURE TO DOWNLOAD AND INSTALL PACKAGES:

Step 1 : Install Pyhton IDLE or any Python IDE

Step 2 : To download packages use pip download command (PIP - Pip installs Python)

Syntax : pip download package_name

Example : C:\Users\welcom> pip download numpy

Step 3: To install the downloaded packages use pip install command

Syntax : pip install package_name

Example : C:\Users\welcom> pip install numpy

Step 4 : To view the list of installed packages use pip list command

Syntax : pip list

Example : C:\Users\welcom>pip list

Step 5: To uninstall the packages use pip uninstall command

Syntax : pip uninstall package_name

Example : C:\Users\welcom>pip uninstall numpy

PACKAGE DESCRIPTION

Python Libraries
There are a lot of reasons why Python is popular among developers and one of them is that it has an amaz -
ingly large collection of libraries that users can work with. In this Python Library, we will discuss Python Stan-
dard library and different libraries offered by Python Programming Language: scipy, numpy, etc.
A module is a file with some Python code, and a package is a directory for sub packages and modules. A
Python library is a reusable chunk of code that you may want to include in your programs/ projects.
Python Standard Library

The Python Standard Library is a collection of script modules accessible to a Python program to simplify the
programming process and removing the need to rewrite commonly used commands. They can be used by
'calling/importing' them at the beginning of a script. A list of the Standard Library modules that is most impor-
tant
 time
 sys
 csv
 math
 random
 pip
 os
 statistics
 tkinter
 socket

To display a list of all available modules, use the following command in the Python console: >>> help('mod-
ules')

LIST OF IMPORTANT PYTHON LIBRARIES

Python Libraries for Data Collection


 Beautiful Soup
 Scrapy
 Selenium

Python Libraries for Data Cleaning and Manipulation


 Pandas
 PyOD
 NumPy
 Scipy
 Spacy

Python Libraries for Data Visualization


 Matplotlib
 Seaborn
 Bokeh

Python Libraries for Modeling


 Scikit-learn
 TensorFlow
 PyTorch

Python Libraries for Model Interpretability


 Lime
 H2O

Python Libraries for Audio Processing


 Librosa
 Madmom
 pyAudioAnalysis

Python Libraries for Image Processing


 OpenCV-Python
 Scikit-image
 Pillow

Python Libraries for Database


 Psycopg
 SQLAlchemy

Python Libraries for Deployment


 Flask

1.1 NUMPY LIBRARIES

NumPy (Numerical Python) is a linear algebra library in Python. It is a very important library on which
almost every data science or machine learning Python packages such as SciPy (Scientific Python), Mat−plotlib
(plotting library), Scikit-learn, etc depends on to a reasonable extent. NumPy is very useful for performing
mathematical and logical operations on Arrays. It provides an abundance of useful features for operations on
n-arrays and matrices in Python.

PYTHON NUMPY LIBRARY


NumPy is an open source library available in Python that aids in mathematical, scientific, engineering, and data
science programming. NumPy is an incredible library to perform mathematical and statistical operations. It
works perfectly well for multi-dimensional arrays and matrices multiplication

NumPy deals with multi-dimensional arrays and matrices. On top of the arrays and matrices, NumPy supports
a large number of mathematical operations.

NumPy is memory efficiency, meaning it can handle the vast amount of data more accessible than any other li-
brary. Besides, NumPy is very convenient to work with, especially for matrix multiplication and reshaping. On
top of that, NumPy is fast. In fact, TensorFlow and Scikit learn to use NumPy array to compute the matrix mul-
tiplication in the back end.

Arrays in NumPy:

 NumPy‘s main object is the homogeneous multidimensional array.

 It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive inte-
gers.

 In NumPy dimensions are called axes. The number of axes is rank.

 NumPy’s array class is called ndarray. It is also known by the alias array.

We use python numpy array instead of a list because of the below three reasons:

 Less Memory

 Fast

 Convenient

FUNCTIONS AND METHODS OVERVIEW

Here is a list of some useful NumPy functions and methods names ordered in categories.

Array Creation

arange, array, copy, empty, empty_like, eye, fromfile, fromfunction, identity, linspace, logspace, mgrid, ogrid,
ones, ones_like, r, zeros, zeros_like

Conversions
ndarray.astype, atleast_1d, atleast_2d, atleast_3d, mat

Manipulations

array_split, column_stack, concatenate, diagonal, dsplit, dstack, hsplit, hstack, ndarray.item, newaxis, ravel,
repeat, reshape, resize, squeeze, swapaxes, take, transpose, vsplit, vstack

Questions

all, any, nonzero, where

Ordering

argmax, argmin, argsort, max, min, ptp, searchsorted, sort

Operations

choose, compress, cumprod, cumsum, inner, ndarray.fill, imag, prod, put, putmask, real, sum

Basic Statistics

cov, mean, std, var

Bas cross, dot, outer, linalg.svd, vdot


1.2 SCIPY LIBRARY

 SciPy is an Open Source Python-based library, which is used in mathematics, scientific computing, Engi-
neering, and technical computing. SciPy also pronounced as "Sigh Pi."

 SciPy contains varieties of sub packages which help to solve the most common issue related to Scien-
tific Computation.

 SciPy is the most used Scientific library only second to GNU Scientific Library for C/C++ or Matlab's.

 Easy to use and understand as well as fast computational power.

 It can operate on an array of NumPy library.

Numpy VS SciPy

Numpy:
1. Numpy is written in C and used for mathematical or numeric calculation.

2. It is faster than other Python Libraries

3. Numpy is the most useful library for Data Science to perform basic calculations.

4. Numpy contains nothing but array data type which performs the most basic operation like sorting, shaping,
indexing, etc.

SciPy:

1. SciPy is built in top of the NumPy

2. SciPy is a fully-featured version of Linear Algebra while Numpy contains only a few features.

3. Most new Data Science features are available in Scipy rather than Numpy.

A concise list of SciPy sub-modules is shown below

Scipy.fftpack Fast Fourier Transform

Scipy.interpolate Interpolation

Scipy.integrate Numerical Integration

Scipy.linalg Linear Algebra

Scipy.io File Input/Output

Scipy.optimize Optimization and Fits

Scipy.stats Statistics

Scipy.signal Signal Processing

1.3 PANDAS LIBRARY

The primary two components of pandas are the Series and DataFrame.
 A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection
of Series.
 DataFrames and Series are quite similar in that many operations that you can do with one you can do
with the other, such as filling in null values and calculating the mean.

Reading data from CSVs


With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv')
print(df)

Let's load in the IMDB movies dataset to begin:


movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
We're loading this dataset from a CSV and designating the movie titles to be our index.
Viewing your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We ac -
complish this with .head():
movies_df.head()
Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):
movies_df.shape
Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows
and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For example, you might filter some rows
based on some criteria and then want to know quickly how many rows were removed.

Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate
rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:
temp_df = movies_df.append(movies_df)
temp_df.shape
Out:
(2000, 11)
Using append() will return a copy without affecting the original DataFrame. We are capturing this copy in temp
so we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled. Now we can try dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your DataFrame, but this time with
duplicates removed. Calling .shape confirms we're back to the 1000 rows of our original dataset.

It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this reason,
pandas has the inplace keyword argument on many of its methods. Using inplace=True will modify the
DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Now our temp_df will have the transformed data automatically.
Another important argument for drop_duplicates() is keep, which has three possible options:
 first: (default) Drop duplicates except for the first occurrence.
 last: Drop duplicates except for the last occurrence.
 False: Drop all duplicates.

Since we didn't define the keep arugment in the previous example it was defaulted to first. This means that if
two rows are the same pandas will drop the second row and keep the first row. Using last has the opposite ef -
fect: the first row is dropped.
keep, on the other hand, will drop all duplicates. If two rows are the same then both will be dropped. Watch
what happens to temp_df:
temp_df = movies_df.append(movies_df) # make a new copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left over. If you're
wondering why you would want to do this, one reason is that it allows you to locate all duplicates in your
dataset. When conditional selections are shown below you'll see how to do that.
Column cleanup
Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and
typos. To make selecting data by column name easier we can spend a little time cleaning up their names.
Here's how to print the column names of our dataset:
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')
Not only does .columns come in handy if you want to rename columns by allowing for simple copy and paste,
it's also useful if you need to understand why you are receiving a Key Error when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. We don't want parentheses, so
let's rename those:
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')
Excellent. But what if we want to lowercase all names? Instead of using .rename() we could also set a list of
names to the columns like so:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
But that's too much work. Instead of just renaming each column manually we can do a list comprehension:
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns

Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
list (and dict) comprehensions come in handy a lot when working with pandas and data in general.
It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be
working with a dataset for some time.
How to work with missing values

When exploring data, you‘ll most likely encounter missing or null values, which are essentially place -
holders for non-existent values. Most commonly you'll see Python's None or NumPy's np.nan, each of which
are handled differently in some situations.
There are two options in dealing with nulls:
1. Get rid of rows or columns with nulls

2. Replace nulls with non-null values, a technique known as imputation


Let's calculate to total number of nulls in each column of our dataset. The first step is to check which cells in
our DataFrame are null:
movies_df.isnull()
Notice isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.
To count the number of nulls in each column we use an aggregate function for summing:
movies_df.isnull().sum()
DataFrame slicing, selecting, extracting
Up until now we've focused on some basic summaries of our data. We've learned about simple column extrac-
tion using single brackets, and we imputed null values in a column using fillna(). Below are the other methods
of slicing, selecting, and extracting you'll need to use constantly.
Let's look at working with columns first.
By column
You already saw how to extract a column using square brackets like this:
genre_col = movies_df['genre']
type(genre_col)

1.4 MATPLOTLIB LIBRARY

To make necessary statistical inferences, it becomes necessary to visualize your data and Matplotlib is one
such solution for the Python users. It is a very powerful plotting library useful for those working with Python
and NumPy. The most used module of Matplotib is Pyplot which provides an interface like MATLAB but in-
stead, it uses Python and it is open source.
General Concepts

A Matplotlib figure can be categorized into several parts as below:


1. Figure: It is a whole figure which may contain one or more than one axes (plots). You can think of a Figure as
a canvas which contains plots.

2. Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains two or three (in
the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.

3. Axis: They are the number line like objects and take care of generating the graph limits.

4. Artist: Everything which one can see on the figure is an artist like Text objects, Line2D objects, collection ob -
jects. Most Artists are tied to Axes.

Matplotlib Library

Pyplot is a module of Matplotlib which provides simple functions to add plot elements like lines, images, text,
etc. to the current axes in the current figure.

Make a simple plot

import matplotlib.pyplot as plt import numpy as np

List of all the methods as they appeared.

 plot(x-axis values, y-axis values) — plots a simple line graph with x-axis values against y-axis values
 show() — displays the graph
 title(―string) — set the title of the plot as specified by the string
 xlabel(―string) — set the label for x-axis as specified by the string
 ylabel(―string) — set the label for y-axis as specified by the string
 figure() — used to control a figure level attributes
 subplot(nrows, ncols, index) — Add a subplot to the current figure
 suptitle(―string) — It adds a common title to the figure specified by the string
 subplots(nrows, ncols, figsize) — a convenient way to create subplots, in a single call. It returns a tuple
of a figure and number of axes.
 set_title(―string) — an axes level method used to set the title of subplots in a figure
 bar(categorical variables, values, color) — used to create vertical bar graphs
 barh(categorical variables, values, color) — used to create horizontal bar graphs
 legend(loc) — used to make legend of the graph
 xticks(index, categorical variables) — Get or set the current tick locations and labels of the x-axis
 pie(value, categorical variables) — used to create a pie chart
 hist(values, number of bins) — used to create a histogram
 xlim(start value, end value) — used to set the limit of values of the x-axis
 ylim(start value, end value) — used to set the limit of values of the y-axis
 scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values against y-axis values
 axes() — adds an axes to the current figure
 set_xlabel(―string) — axes level method used to set the x-label of the plot specified as a string
 set_ylabel(―string‖) — axes level method used to set the y-label of the plot specified as a string
 scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot with x-axis values
against y-axis values
 plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph with x-axis values against y-
axis values
2(A) CREATE 1D ,2D,ARANGE,RANDOM
# 1D array using numpy
PROGRAM:
import numpy as np
arr = np.array([10,30,50,60,90])
print(arr)
OUTPUT:
arr= [10 30 50 70 90]
# 2D array using numpy
import numpy as np
arr = np.array([[20,40,60,80,100], [10,30,50,70,90]])
print("arr=",arr)
OUTPUT:
arr= [[ 20 40 60 80 100]
[ 10 30 50 70 90]]
#arange using numpy
PROGRAM:
import numpy as geek
print("A\n", geek.arange(4).reshape(2, 2), "\n")
print("A\n", geek.arange(3,9), "\n")
print("A\n", geek.arange(3, 19, 2), "\n")
OUTPUT:
A
[[0 1]
[2 3]]
A
[3 4 5 6 7 8]
A
[ 3 5 7 9 11 13 15 17]
#random using numpy
PROGRAM:
from numpy import random
a = random.randint(10)
print("a=",a)
OUTPUT:
a= 7
2(b) CONCATENATE,APPEND, TO LIST ,ASARRAY USING NUMPY
# concatenate two arrays using numpy
PROGRAM:
import numpy as np
arr1 = np.array([10,20,30])
arr2 = np.array([40,50,60])
arr3 = np.hstack((arr1, arr2))
print("arr3=",arr3)
OUTPUT:
arr3= [10 20 30 40 50 60]
#append two arrays using numpy
PROGRAM:
import numpy as np
arr1=np.array([10,20,30])
arr2=np.array([40,50,60])
arr3=np.append(arr1,arr2)
print(“Append arr3:”,arr3)
OUTPUT:
Append arr3:[10 20 30 40 50 60 ]
#array to list using numpy
PROGRAM:
import numpy as np
arr_1 = np.array([1, 2, 3])
print(f'NumPy Array:\n{arr_1}')
OUTPUT:
NumPy Array:
[1 2 3]
#asarray using numpy
PROGRAM:
import numpy as geek
my_list = [1, 3, 5, 7, 9]
print ("Input list : ", my_list)
out_arr = geek.asarray(my_list)
print ("output array from input list : ", out_arr)
OUTPUT:
Input list : [1, 3, 5, 7, 9]
output array from input list : [1 3 5 7 9]
2(C) MAXIMUM MINIMUM AND SORTING OF ARRAYS
#minimum values of rows and column
PROGRAM:
import numpy as np
s= np.array([[7,1,2,3],[89,2,4,5],[2,3,7,1],[3,4,1,17]])
print(s)
z=np.argmin((s),axis=0)
x=np.argmin((s),axis=1)
print("min values of column:",z)
print("min value of row:",x)
OUTPUT:
[[ 7 1 2 3]
[89 2 4 5]
[ 2 3 7 1]
[ 3 4 1 17]]
min values of column: [2 0 3 2]
min value of row: [1 1 3 2]

#maximum values of rows and columns


PROGRAM:
import numpy as np
s= np.array([[7,1,2,3],[89,2,4,5],[2,3,7,1],[3,4,1,17]])
print(s)
z=np.argmax((s),axis=0)
x=np.argmax((s),axis=1)
print("max values of column:",z)
print("max value of row:",x)

OUTPUT:
[[ 7 1 2 3]
[89 2 4 5]
[ 2 3 7 1]
[ 3 4 1 17]]
max values of column: [1 3 2 3]
max value of row: [0 0 2 3]
# sort array
PROGRAM:
import numpy as n
s= n.array([[10,9,8],[7,6,5],[4,3,2],[1,0,5]])
print(s)
y=n.sort((s),axis=0)
z=n.sort((s),axis=1)
print("column \n",y)
print("row \n",z)
OUTPUT:
[[10 9 8]
[ 7 6 5]
[ 4 3 2]
[ 1 0 5]]
column
[[ 1 0 2]
[ 4 3 5]
[ 7 6 5]
[10 9 8]]
row
[[ 8 9 10]
[ 5 6 7]
[ 2 3 4]
[ 0 1 5]]
2(D) ARRAY OPERATORS
# Numpy Array Operators
PROGRAM:
import numpy as np
arr1=np.array([4,6,20])
arr2=np.array([2,3,4])
add1=np.add(arr1,arr2)
print(“Add two arrays :”,add1)
sub1=np.subtract(arr1,arr2)
print(“Subtract two arrays:”,sub1)
mul1=np.multiply(arr1,arr2)
print(“Multiply two arrays:”,mul1)
div1=np.divide(arr1,arr2)
print(“Divide two arrays:”,div1)
mod1=np.mod(arr1,arr2)
print(“Mod of two arrays:”,mod1)
pow1=np.power(arr1,arr2)
print(“Power of two arrays:”,pow1)
reci1=np.reciprocal (arr1)
print(“Reciprocal of array is:”,reci1)

OUTPUT:
Add two arrays:[6 9 24]
Subtract two arrays:[2 3 16]
Multiply two arrays;[8 18 80]
Diviide two arrays:[2 2 5]
Mod of two arrays[0 0 0]
Power of two arrays[16 216 160000]
Reciprocal of two arrays[0 0 0]
#zeros and ones of arrays using numpy()
PROGRAM:
import numpy as np
arr1=np.zeros(5)
arr2=no.ones(5)
print(“Zeros array is:”,arr1)
print(“Ones array is:”,arr2)
OUTPUT:
Zeros array is:[0.0.0.0.0]
Ones array is:[1.1.1.1.1]
Ex.No:3
Working with Pandas DataFrames
PROGRAM:
import pandas as pd
import numpy as np
sub = [ ["AI",120000,"30days"],["Physics",220000, "40days"],["french", 160000,"6months"]] df=pd.DataFrame(sub)
print(df)
column_name=["courses","Fee","Duration"] row_lable=["a","b","c"] df=pd.DataFrame(sub,columns=column_name,in-
dex=row_lable)
print(df) print(df.dtypes)
# Create DataFrame from Dictionary
sub ={'Courses':["ML","DATA ANALYSIS","3D PRINTING"],
'Fee':[120000,125000,126000],'Duration':['37day','60days','72days'],'Discount':[10000,20300,10500]
}
df = pd.DataFrame(sub)
print(df)
# Create DataFrame
sub = ({'Courses':["HTML","SQL","Python","C","Java","C++","C#","Java Script"],
'Fee':[122000,53000,36000,47000,80000,np.nan,45000,82000],
'Duration':['70day','60days','45days', '70days','40days',' ','71days','30days'],
'Discount':[11000,12300,11000,11200,21500,13100,11400,16100] })
df = pd.DataFrame(sub, index=['r0','r1','r2','r3','r4','r5','r6','r7'])
print(df)
# Create Series from array
import pandas as pd
import numpy as np
data = np.array(["SUNDAR",21,2004])
series = pd.Series(data)
print ("********Create Series from array ********\n", series)

# Create a Dict from a input


data = {'Courses' :"HTML", 'Fees' : 440000, 'Duration' : "40days"}
s2 = pd.Series(data)
print("********Create a Dict from a input********\n",s2)
#Creating DataFrame from List
data = ['Python','C','Java Script'] s2 = pd.Series(data, index=['C1', 'C2','C3'])
print("********Creating DataFrame from List********\n",s2)
df= pd.DataFrame(data)
print(df)

OUTPUT:
0 1 2
0 AI 120000 30days
1 Physics 220000 40days
2 french 160000 6months
courses Fee Duration
a AI 120000 30days
b Physics 220000 40days
c french 160000 6months
courses object
Fee int64 Duration
object dtype: object
Courses Fee Duration Discount
0 ML 120000 37day 10000
1 DATA ANALYSIS 125000 60days 20300
2 3D PRINTING 126000 72days 10500

Courses Fee Duration Discount r0

HTML 122000.0 70day 11000 r1


SQL 53000.0 60days 12300 r2
Python 36000.0 45days 11000 r3
C 47000.0 70days 11200 r4
Java 80000.0 40days 21500 r5
C++ NaN 13100 r6
C# 45000.0 71days 11400 r7
JavaScript 82000.0 30days 16100 r8
********Create Series from array ********
0 SUNDAR
1 21 2 2004 dtype: object
********Create a Dict from a input********
Courses HTML
Fees 440000 Duration 40days dtype: object
********Creating DataFrame from List********
C1 Python
C2 C
C3 Java Script
dtype: object
0
0 Python
1 C
2. Javascript
EX. NO:4A
Reading Data From Text Files, Excel And The Web And Exploring Various Commands For Doing Descriptive
Analytics On The Iris Data Set
READING DATA FROM EXCEL AND CSV FILE

Program:
import pandas as pd

df = pd.read_csv(‘example1.csv’,names=[“date”,”time”,”code”,”values”])
print(pd.unique(df[“date”]))
df = pd.read_excel(‘example.xlsx’, names = [“s.no”,”spr.no”,”name”])
print(df.to_string())
print(“print 1st five line”)
print(df.head(5))
print(df.tail(5))
# to print columns
print(df.columns)
# to print specific column
print(df[[‘name’,’spr.no’]].to_string())
# to describe the column spr no’
print(df[‘spr.no’].describe())
# to slice the rows
print(df.iloc[5:9])
# to read specific cell
print(df.iloc[9,2])
# to print row wise
for index, row in df.iterrows():
print(row)
print(“*******************”)
print(df.loc[df[‘spr.no’]==”8507”])
print(df.describe())
print(“*************** sorting************”)
print(df.sort_values([‘name’], ascending=false).to_string())

OUTPUT
[‘date’ ‘1.2.2022’ ‘4.6.2022’ ’23.6.2022’ ‘5.7.2022’ ‘6.8.2022’
’30.4.2022’ ‘5.10.2022’ ‘3.12.2022’ ‘5.12.2022’]
s.no spr.no name
0 1 8501 arun
1 2 8502 dhanush
2 3 8503 surya
3 4 8504 karthi
4 5 8505 raja
5 6 8506 mukesh
6 7 8507 mani
7 8 8508 ram
8 9 8509 vijay
9 10 8510 tamil

Print 1st five line


s.no spr.no name
0 1 8501 arun
1 2 8502 dhanush
2 3 8503 surya
3 4 8504 karthi
4 5 8505 raja
s.no spr.no name
5 6 8506 mukesh
6 7 8507 mani
7 8 8508 ram
8 9 8509 vijay
9 10 8510 tamil

Index([‘s.no’, ‘spr.no’, ‘name’], dtype=’object’)


Name spr.no
0 arun 8501
1 dhanush 8502
2 surya 8503
3 karthi 8504
4 raja 8505
5 mukesh 8506
6 mani 8507
7 ram 8508
8 vijay 8509
9 tamil 8510

Count 10.00000
Mean 8505.50000
Std 3.02765
Min 8501.00000
25% 8503.25000
50% 8505.50000
75% 8507.75000
Max 8510.00000
Name: spr.no, dtype: float64
s.no spr.no name
5 6 8506 mukesh
6 7 8507 mani
7 8 8508 ram
8 9 8509 vijay

Tamil
s.no 1
spr.no 8501
name arun
Name: 0, dtype: object
s.no 2
spr.no 8502
name dhanush
Name: 1, dtype: object
s.no 3
spr.no 8503
name surya
Name: 2, dtype: object
s.no 4
spr.no 8504
name karthi
Name: 3, dtype: object
s.no 5
spr.no 8505
name raja
Name: 4, dtype: object
s.no 6
spr.no 8506
name mukesh
Name: 5, dtype: object
s.no 7
spr.no 8507
name mani
Name: 6, dtype: object
s.no 8
spr.no 8508
name ram
Name: 7, dtype: object
s.no 9
spr.no 8509
name vijay
Name: 8, dtype: object
s.no 10
spr.no 8510
name tamil
Name: 9, dtype: object
**********************************
s.no spr.no
count 10.00000 10.00000
mean 5.50000 8505.50000
std 3.02765 3.02765
min 1.00000 8501.00000
25% 3.25000 8503.25000
50% 5.50000 8505.50000
75% 7.75000 8507.75000
Max 10.00000 8510.00000

*************** Sorting************
s.no spr.no name
8 9 8509 vijay
9 10 8510 tamil
2 3 8503 surya
7 8 8508 ram
4 5 8505 raja
5 6 8506 mukesh
6 7 8507 mani
3 4 8504 karthi
1 2 8502 dhanush
0 1 8501 arun
EX.NO:4(B)

READ FROM TEXT FILE


PROGRAM:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
f=open("C:\\Users\\Nellish\\Downloads\\sample.txt","rt",encoding ="UTF-8")
line = f.read()
print("File Contents:\n",line.strip()) f.close()
line2 = line.split()
from collections import Counter import numpy as
np
import matplotlib.pyplot as plt word_list = line2
counts = Counter(word_list) labels,values =
zip(*counts.items()) indSort = np.argsort(values)[::-
1] labels = np.array(labels)[indSort] values = np.ar-
ray(values)[indSort] indexes = np.arange(len(la-
bels)) bar_width = 0.25 plt.figure(figsize=(15,5))
plt.bar(indexes, values) plt.xticks(indexes +
bar_width, labels)
plt.show()
OUTPUT:
File Contents:
Modern football originated in Britain in the 19th century. Though “folk football” had been played since medieval times
with varying rules, the game began to be standardized when it was taken up as a winter game at public schools.
Ex.No:4(c)
Read Web Scrapping (Beautiful Soup)
Program:
import requests
from bs4 import BeautifulSoup
import pandas as pd

books = []

for i in range(1, 51):

url = f"https://books.toscrape.com/catalogue/page-{i}.html"
response = requests.get(url)
response = response.content
soup = BeautifulSoup(response, 'html.parser')
ol = soup.find('ol')
articles = ol.find_all('article', class_='product_pod')

for article in articles:


image = article.find('img')
title = image.attrs['alt']
star = article.find('p')
star = star['class'][1]
price = article.find('p', class_='price_color').text
price = float(price[1:])
books.append([title, price, star])

df = pd.DataFrame(books, columns = ['Title', 'Price', 'Star Rating'])


df.to_csv('books.csv')
print(df)
OUTPUT:

Title Price StarRating

1 A Light in the Attic 51.77 Three

2 Tipping the Velvet 53.74 One

3 Soumission 50.10 One

4 Sharp Objects 47.82 Four

4 Sapiens: A Brief History of Humankind 54.23 Five

.. ... ... ...


1 Alice in Wonderland (Alice's Adventures in Won... 55.53 One

2 Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1) 57.06 Four

3 A Spy's Devotion (The Regency Spies of London #1) 16.97 Five

4 1st to Die (Women's Murder Club #1) 53.98 One

5 1,000 Places to See Before You Die 26.08 Five

[1000 rows x 3 columns]

Process finished with exit code 0


EX.NO:5(A)

Univariate analysis: Frequency, Mean, Median, Mode,

Variance Standard Deviation,Skewness and Kurtosis.

PROGRAM:

import pandas as pd import numpy as np import statistics as st df=pd.read_csv ("C:\\

Users\\21cse8528\\Downloads\\diabetes.csv") print (df.shape) print (df.info()) print

('MODE:\n',df.mode()) print ('MEDIAN:\n',df.median()) print ('STANDARD DEVIATION:\

n',df.std()) print ('VARIANCE:\n',df.skew()) print ('KURTOSIS:\n',df.kurtosis()) df.describe()

data_x=df.copy(deep=True) data_x=data_x.drop(['Outcome'],axis=1) import matplotlib.py-

plot as plt plt.rcParams['figure.figsize']=[40,40] data_x.hist(bins=40)

OUTPUT:

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- ----- 0 Pregnancies

768 non-null int64

1 Glucose 768 non-null int64

2 BloodPressure 768 non-null int64

3 SkinThickness 768 non-null int64

4 Insulin 768 non-null int64

5 BMI 768 non-null float64

6 DiabetesPedigree 768 non-null float64

Function
7 Age 768 non-null int64 8 Outcome 768 non-null int64

dtypes: float64(2), int64(7) memory usage: 54.1 KB

None

MODE:

Pregnancies Glucose BloodPressure ... DiabetesPedigreeFunction Age Outcome

0 1.0 99 70.0 ... 0.254 22.0 0.0

1 NaN 100 NaN ... 0.258 NaN NaN

[2 rows x 9 columns]

MEDIAN:

Pregnancies 3.0000

Glucose 117.0000

BloodPressure 72.0000 SkinThickness 23.0000

Insulin 30.5000

BMI 32.0000

DiabetesPedigreeFunction 0.3725

Age 29.0000 Outcome

0.0000 dtype: float64

STANDARD DEVIATION:

Pregnancies 3.369578

Glucose 31.972618

BloodPressure 19.355807

SkinThickness 15.952218

Insulin 115.244002

BMI 7.884160

DiabetesPedigreeFunction 0.331329
Age 11.760232 Outcome

0.476951 dtype: float64

VARIANCE:

Pregnancies 0.901674

Glucose 0.173754

BloodPressure -1.843608

SkinThickness 0.109372

Insulin 2.272251

BMI -0.428982

DiabetesPedigreeFunction 1.919911

Age 1.129597 Outcome

0.635017 dtype: float64

KURTOSIS:

Pregnancies 0.159220

Glucose 0.640780

BloodPressure 5.180157

SkinThickness -0.520072

Insulin 7.214260

BMI 3.290443

DiabetesPedigreeFunction 5.594954

Age 0.643159 Outcome -

1.600930 dtype: float64


Ex.No:5(B)

BIVRIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION MODELING


PROGRAM:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes =datasets.load_diabetes()
print(diabetes.DESCR)
print(diabetes.feature_names)
X = diabetes.data
Y = diabetes.target
print(X.shape, Y.shape)
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)
print(train_x.shape, train_y.shape)
from sklearn.linear_model import
LinearRegression le = LinearRegression()
le.fit(train_x,train_y)
y_pred = le.predict(test_x)
print(y_pred)
result = pd.DataFrame({'Actual': test_y, 'Predict' : y_pred})
print(result.to_string())
print('coefficient', le.coef_)
print('intercept', le.intercept_)
from sklearn.metrics import mean_squared_error, r2_score
print(mean_squared_error(test_y,y_pred))
print(r2_score(test_y,y_pred))
plt.scatter(test_y,y_pred, color = 'green')
plt.xticks(())
plt.yticks(())
plt.show()
OUTPUT:
Diabetes dataset
--
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation
times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression,"
Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
(442, 10) (442,)
(309, 10) (309,)
[ 77.9991034 170.44712136 109.03660582 223.84307065 87.38430375
211.46934588 223.65994161 52.81888351 149.39008619 294.9893952
127.72956377 182.90399415 102.6375881 144.69460471 171.5213581
266.17922455 201.8877034 166.18209687 103.67082991 169.01866852
187.13785171 130.10161695 151.54411583 156.45841795 121.85578431
304.14149903 126.55107098 158.76430506 249.42202516 154.22310276
180.85528758 180.06909853 182.96467155 200.00132316 73.87031841
146.19357531 165.52522033 160.93215348 247.99028596 210.37177183
85.69905264 211.07539533 188.10590426 119.60434815 151.80766971
188.31163328 185.69251949 168.92581539 291.55993431 248.60092291
170.17035216 208.5515447 59.08071813 195.30432554 190.19923551
149.97489689 114.4835119 244.83078249 254.54782428 138.88949628
301.05425333 57.71483254 162.93256009 187.59937115 274.76599026
158.21099102 181.84517605 151.29485495 199.55472957 197.6359619
83.37352464 127.99031348 179.13616685 161.69195267 106.02168539
213.44405941 99.63818719 116.80068856 176.2415266 190.88134043
167.39910679 185.39753187 120.19155677 94.95663036 158.22520798
189.80858086 166.40135891 235.30476439 230.3206865 236.6916711
176.02450278 135.15234363 152.82007506 124.46215244 213.574207
42.64060377 200.72397971 64.05606538 160.31380961 61.10040585
161.21565469 169.64372088 198.39611428 63.92041427 129.65961388
171.88537096 175.28400448 126.28643791 163.35396162 92.43346316
156.32693509 174.65381761 229.30521464 128.22372219 124.33529553
64.52215982 217.54827171 230.22439161 190.07750782 161.96910026
123.80090397 161.67791209 132.53128138 116.96403575 266.68910399
275.9219297 86.84652207 60.94725563 202.34939075 144.66352278
82.4008924 185.44354389 101.56321284]
coefficient [ 40.66179183 -313.2934701 517.18345014 386.06561935 -604.64289258
275.31448303 3.92015562 172.39101348 661.95332633 62.26068823]
intercept 155.59120125597175
3157.972848565651
0.4545709909725648
3157.972848565651
Ex.no:5-C
MULTIPLE REGRESSION ANALYSIS

PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import seaborn as
sns
from sklearn import datasets,
linear_model,metrics
df = pd.read_csv('C:\\Users\\
Nellish\\Downloads\\
diabetes.csv')
data = df[['Age','Glucose','BMI',
'BloodPressure','Pregnancies']] target = df[['Out-
come']] print(data)
print(target) print(df.isna().sum())
X= data
Y = target
#Split in to training and test data
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3,random_state =101)
reg = linear_model.LinearRegression()
reg.fit(X_train, Y_train)
Y_predict = reg.predict(X_test)
print("Coeeficient: ",reg.coef_)
print("Variance Score:{}".format(reg.score(X_test, Y_test)))
from sklearn.metrics import r2_score
print("r2:",r2_score(Y_test, Y_predict))
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, Y_predict)
rmse = np.sqrt(mse)
print("RMSE:",rmse)
plt.figure(figsize =(12,10))
sns.pairplot(df[['Age','Glucose','BMI', 'BloodPressure','Pregnancies',
'Outcome']])
plt.show()
plt.figure(figsize =(8,6))
sns.lmplot(x="BMI", y="Glucose", data=df)
plt.show()

OUTPUT:

Age Glucose BMI BloodPressure Pregnancies


0 50 148 33.6 72 0 1 31 85
26.6 66 1
1 32 183 23.3 64 8
2 21 89 28.1 66 1
3 33 137 43.1 40 0
.. ... ... ... ... ...
1 63 101 32.9 76 10
2 27 122 36.8 70 2
3 30 121 26.2 72 5
4 47 126 30.1 60 1
5 23 93 30.4 70 1

[768 rows x 5 columns]


Outcome
1 1
2 0
3 1
4 0
5 1
.. ...
1 0
2 0
3 0
4 1
5 0

[768 rows x 1 columns]


Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0 dtype: int64
Coeeficient: [[ 0.00362921 0.0057603 0.01359201 -0.0022797 0.01903324]] Variance Score:0.3119613858813982 r2:
0.3119613858813982
RMSE: 0.39580617490439185
Ex.No 6(A)
Apply and explore various plotting functions on UCI data sets.
NORMAL CURVES
Program:
import numpy as np from scipy.stats
import norm
import matplotlib.pyplot as plt

# Generate some data for this demonstration

data = np.random.normal(170, 10, 250)

# Fit a normal distribution to the data:


# mean and standard deviation mu,

std = norm.fit(data)

# Plot the histogram.

plt.hist(data, bins=25, density=True, alpha=0.6, color='b')

# Plot the PDF.


xmin, xmax = plt.xlim()

x = np.linspace(xmin, xmax, 100)

p = norm.pdf(x, mu, std)

plt.plot(x, p, 'k', linewidth=2)

title = "Fit Values: {:.2f} and {:.2f}".format(mu, std)

plt.title(title)

plt.show()

OUTPUT:
EX.NO:6(B)
Apply and explore various plotting functions on UCI data sets.
DENSITY CONTOUR PLOTS

PROGRAM :

import matplotlib.pyplot as plt


plt.style.use('seaborn-white') import numpy as np
def f(x, y):

return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x) x =


np.linspace(0, 5, 50) y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y) Z = f(X, Y) plt.contour(X, Y, Z,
colors='black') plt.show() plt.contour(X, Y, Z, 20,
cmap='RdGy') plt.show() plt.contourf(X, Y, Z, 20,
cmap='RdGy') plt.colorbar() plt.show()
OUTPUT:

Contour (Black & White)


Contour ( Red Grey)

Contourf (Red Grey)

#IRIS DATASET DENSITY


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv("D:\DataScience\Iris.csv")
sns.jointplot(x="petal_length", y="petal_width", hue = 'species', data=data, kind="kde")
plt.show()

OUTPUT
EX.NO:6(C)

Apply and explore various plotting functions on UCI data sets.


SCATTER PLOT AND CORRELATION

PROGRAM :

import pandas as pd import seaborn as sns import mat-

plotlib.pyplot as plt import numpy as np data =

pd.read_csv("D:\DataScience\Iris.csv") sns.set_style("white-

grid");

sns.FacetGrid(data, hue="species", size=4) \ .map(plt.scatter, "SepalLengthCm", "SepalWidthCm")


\.add_legend()
plt.show()
sns.FacetGrid(data, hue="species", size=4) \ .map(plt.scatter, "petal_length", "petal_width") \.add_legend()
plt.show()

OUTPUT

# Correction Matrix Plot import matplotlib.pyplot


asplt
import pandas as pd
import numpy as np
names=['preg','plas','pres','skin','test','mass','pedi','age','class']
data= pd.read_csv("D:\DataScience\pima-indians-diabetes.csv",names=names) correlations=data.corr()
# plot correlation matrix
fig=plt.figure()
ax=fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks= np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

OUTPUT:
EX.NO:6(D)

Apply and explore various plotting functions on UCI data sets.


HISTOGRAMS

PROGRAM:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

data =pd.read_csv("D:\DataScience\Iris.csv")

sns.set_style("whitegrid");

sns.pairplot(data, hue="species", size=3); plt.show()

OUTPUT

#Histogram

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv("D:\DataScience\Iris.csv")
sns.FacetGrid(data, hue="species", height=5) \.map(sns.hist-
plot, "petal_length") \.add_legend()
plt.show()
sns.FacetGrid(data, hue="species", height=5) \.map(sns.hist-
plot, "petal_width") \.add_legend()
plt.show()
sns.FacetGrid(data, hue="species", height=5) \.map(sns.hist-
plot, "SepalLengthCm") \.add_legend()
plt.show()
OUTPUT:
EX.NO:6(E)

Apply and explore various plotting functions on UCI data sets.


THREE DIMENSIONAL PLOTTING

PROGRAM :

#3D-Line

from mpl_toolkits import plot3d import numpy as

np

import matplotlib.pyplot as plt

fig = plt.figure()

ax = plt.axes(projection='3d')

z = np.linspace(0, 1, 100)

x = z * np.sin(20 * z)

y = z * np.cos(20 * z)

ax.plot3D(x, y, z, 'gray') ax.set_title('3D line plot')

plt.show()

OUTPUT:
Ex. No 7 :
Visualizing Geographic Data with Basemap
# Geographic data with Basemap

import numpy as np

import matplotlib.pyplot as plt from mpl_tool-

kits.basemap import Basemap

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lack', resolution=None,width=8E6, height=8E6, lat_0=45,

lon_0=-100,)

m.etopo(scale=0.5, alpha=0.5)

# Map

(long, lat) to (x, y)

for plotting x, y = m(-122.3, 47.6) plt.plot(x, y, 'ok',

markersize=5) plt.text(x, y, ' Seattle', fontsize=12);

plt.show()

OUTPUT:

# Coastal Lines

import numpy as np
import matplotlib.pyplot as plt from mpl_tool-

kits.basemap import Basemap

fig = plt.figure(figsize=(12,12))

m= Basemap()

m.drawcoastlines()

plt.title("Coastal Lines",fontsize =20)

plt.show()

OUTPUT

#Orthographic Projection
import numpy as np

import matplotlib.pyplot as plt from mpl_tool-

kits.basemap import Basemap

fig = plt.figure(figsize = (10,8))

m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)


m.drawcoastlines()

m.fillcontinents(color='tan',lake_color='lightblue')

m.drawcountries(linewidth=1, lifestyle='solid', color='k' )

m.drawmapboundary(fill_color='lightblue') plt.title("Orthographic Projec-

tion", fontsize=18) plt.show()

OUTPUT :

You might also like