Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
34 views

Numpy and Pandas

Uploaded by

Suja Mary
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Numpy and Pandas

Uploaded by

Suja Mary
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Pandas

Pandas is a very popular library for working with data (its goal is to be the
most powerful and flexible open-source tool, and in our opinion, it has
reached that goal). DataFrames are at the center of pandas. A DataFrame is
structured like a table or spreadsheet. The rows and the columns both have
indexes, and you can perform operations on rows or columns separately.

A pandas DataFrame can be easily changed and manipulated. Pandas has


helpful functions for handling missing data, performing operations on
columns and rows, and transforming data. If that wasn’t enough, a lot of SQL
functions have counterparts in pandas, such as join, merge, filter by, and
group by. With all of these powerful tools, it should come as no surprise that
pandas is very popular among data scientists.

NumPy
NumPy is an open-source Python library that facilitates efficient numerical
operations on large quantities of data. There are a few functions that exist in
NumPy that we use on pandas DataFrames. For us, the most important part
about NumPy is that pandas is built on top of it. So, NumPy is a dependency
of Pandas.

pip install numpy


pip install pandas

import numpy as np
import pandas as pd

list1 = [1,2,3,4]

array1 = np.array(list1)
print(array1)

list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)
toyPrices = [5,8,3,6]
# print(toyPrices - 2) -- Not possible. Causes an error
for i in range(len(toyPrices)):
toyPrices[i] -= 2
print(toyPrices)

# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13,25,19])
series1 = pd.Series(ages)
print(series1)

# Create a Series using a NumPy array of ages but customize the indices to be the
names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)

dataf = pd.DataFrame([
['John Smith','123 Main St',34],
['Jane Doe', '456 Maple Ave',28],
['Joe Schmo', '789 Broadway',51]
],
columns=['name','address','age'])

Standard Deviation

import numpy

speed = [86,87,88,86,87,85,86]

x = numpy.std(speed)

print(x)
Try it Yourself »
# creating an empty list

lst = []

# number of elements as input

n = int(input("Enter number of elements : "))

# iterating till the range

for i in range(0, n):

ele = int(input())

# adding the element

lst.append(ele)

print(lst)

Program to calculate Percentile of Students

Given an array containing marks of students, the task is to calculate the percentile of the students. The
percentile is calculated according to the following rule:

The percentile of a student is the % of the number of students having marks less than him/her.
Examples:

Input: arr[] = { 12, 60, 80, 71, 30 }


Output: { 0, 50, 100, 75, 25 }
Explanation:
Percentile of Student 1 = 0/4*100 = 0 (out of other 4 students no one has marks less than this student)
Percentile of Student 2 = 2/4*100 = 50 (out of other 4 students, 2 have marks less than this student)
Percentile of Student 3 = 4/4*100 = 100 (out of other 4 students, all 4 have marks less than this student)
Percentile of Student 4 = 3/4*100 = 75 (out of other 4 students, 3 have marks less than this student)
Percentile of Student 5 = 1/4*100 = 25 (out of other 4 students only 1 has marks less than this student)

import numpy as np

# Function to calculate the percentile

def percentile(arr, n):

i, j = 0, 0

count, percent = 0, 0

# Start of the loop that calculates percentile

while i < n:

count = 0

j=0

while j < n:

# Comparing the marks of student i

# with all other students

if (arr[i] > arr[j]):

count += 1

j += 1

percent = (count * 100) // (n - 1)

print("Percentile of Student ", i + 1," = ", percent)

i += 1

# Driver Code
#StudentMarks = [12, 60, 80, 71, 30]

StudentMarks=list(map(int, input("Enter Students Marks:-").strip().split()))

n = len(StudentMarks)

percentile(StudentMarks, n)

p=int(input("Enter Percentile of Marks:"))

x= np.percentile(StudentMarks, p)

print(p ,"the percentile of marks=",x)

Histogram

A histogram is a graphical representation of a set of data points arranged in a user-defined range.


Similar to a bar chart, a bar chart compresses a series of data into easy-to-interpret visual objects
by grouping multiple data points into logical areas or containers.
To draw this we will use:
 random.normal() method for finding the normal distribution of the data. It has three
parameters:
 loc – (average) where the top of the bell is located.
 Scale – (standard deviation) how uniform you want the graph to be distributed.
 size – Shape of the returning Array
 The function hist() in the Pyplot module of the Matplotlib library is used to draw histograms.
It has parameters like:
 data: This parameter is a data sequence.
 bin: This parameter is optional and contains integers, sequences or strings.
 Density: This parameter is optional and contains a Boolean value.
 Alpha: Value is an integer between 0 and 1, which represents the transparency of
each histogram. The smaller the value of n, the more transparent the histogram.

import numpy as np
import matplotlib.pyplot as plt
# Generating some random data
# for an example
data = np.random.normal(170, 10, 250)
# Plotting the histogram.
plt.hist(data, bins=25, density=True,
alpha=0.6, color='b')
plt.show()
 The function hist() in the Pyplot module of the Matplotlib library is used to draw histograms.
It has parameters like:
 data: This parameter is a data sequence.
 bin: This parameter is optional and contains integers, sequences or strings.
 Density: This parameter is optional and contains a Boolean value.
 Alpha: Value is an integer between 0 and 1, which represents the transparency of
each histogram. The smaller the value of n, the more transparent the histogram.

import numpy as np

import matplotlib.pyplot as plt

from scipy import stats

N = 10000

x = stats.norm.rvs(size=N)

num_bins = 20

ax=plt.axes()

ax.set_title('Histogram of Normal Distribution')

plt.hist(x, bins=num_bins, facecolor='blue', alpha=0.5)

y = np.linspace(-4, 4, 1000)

bin_width = (x.max() - x.min()) / num_bins

plt.plot(y, stats.norm.pdf(y) * N * bin_width)

plt.show()

pdf() for displaying the probability density function. This pdf() method present inside
the scipy.stats.norm.

Importance of Standard Normal Distribution


We may use a Standard Score or z-score to compute the probability that a given value comes from a
specific distribution or to compare values from multiple distributions, with a standard normal distribution.
A normal distribution can be easily converted to a Standard normal distribution with the help of the
following formula
How to Draw a Scatter Plot
When scatter plots were discovered, drawing them was a complex task. It often required the use of statisticians and scientists. In the
more recent past most of the drawing has been automated. However there is still a large amount of human involvement as well as
human judgement which is required. The steps are mapped as follows:

Step 1: Decide the Two Variables

The most important step of the analysis is performed even before the analysis begins. In text book problems we assume that we
know the variables between which we have to find correlation. However in real life, there are many variables and therefore many
cases of correlation possible. Selecting the variables in between there exists a material relationship that if understood will benefit the
process is important.

Step 2: Collect Data

Once the variables have been selected, relevant data needs to be collected to draw meaningful conclusions about the same. This
can be done by applying the relevant design of experiment and coming up with measurements that will be used as inputs into the
system. This process like every other follows the principle of GIGO i.e. Garbage In Garbage Out and hence due care must be taken
regarding the input data.

Step 3: Map the Data

Once the data has been collected, it must be mapped on the X and Y axes of the Cartesian Co-ordinate system. This will give the
viewer an idea about where the majority of the points are centred, where the outliers are and why this is the case. Nowadays, this
does not have to be done manually. There are software available that will automatically fetch the incoming data real time and map it
on to a scatter plot.

Step 4: The Line of Best Fit

The next step is to statistically compute the line of best fir for the scattered data points. This means that mathematically a line will be
worked out that fits through most of the lines and is closest to the rest of them. This line has an equation that can be used to predict
the nature of relationship between the variable. This step too, early required complex calculations, prone to human error. Now
software can do it seamlessly and in no time.

Step 5: Come Up With an Exact Number

The next step is to come up with a co-relation co-efficient. This number as stated earlier is the best metric to understand correlation
and lies between -1 and +1. The software will work out and give you a correlation co-efficient. Expensive software are not required.
Something as simple as an excel sheet can be used.

Step 6: Interpret the Number

The last step is to interpret the number. Anything above + or – 0.5 suggests a strong correlation. 0 represents no correlation while -1
or +1 represents perfect co-relation. Perfect correlation may be an indicator for causation. However, it does not imply causation, all
by itself.
Scatter Plot Uses and Examples
Scatter plots instantly report a large volume of data. It is beneficial in the following situations –

 For a large set of data points given


 Each set comprises a pair of values
 The given data is in numeric form

import pandas as pd

import seaborn as sns

# Path of the file to read

insurance_filepath = "D:/smoke.csv"

# Read the file into a variable insurance_data

insurance_data = pd.read_csv(insurance_filepath)

insurance_data.head()

print(insurance_data)

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])

Polynomial Regression

import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd

#importing datasets

data_set= pd.read_csv('D:/reg.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, 1:2].values

y= data_set.iloc[:, 2].values

#Fitting the Linear Regression to the dataset

from sklearn.linear_model import LinearRegression

#Visulaizing the result forPolynomial Regression

from sklearn.preprocessing import PolynomialFeatures

poly_regs= PolynomialFeatures(degree=2)

x_poly= poly_regs.fit_transform(x)

lin_regs =LinearRegression()

lin_regs.fit(x_poly, y)

mtp.scatter(x,y,color="blue")

mtp.plot(x, lin_regs.predict(poly_regs.fit_transform(x)), color="red")

mtp.title("Polynomial Regression")

mtp.xlabel("Position Levels")

mtp.ylabel("Salary")

mtp.show()

Draw a Decision Tree

from matplotlib import pyplot as plt

from sklearn import datasets


from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

# Prepare the data data

iris = datasets.load_iris()

X = iris.data

y = iris.target

clf = DecisionTreeClassifier(random_state=1234)

model = clf.fit(X, y)

text_representation = tree.export_text(clf)

print(text_representation)

fig = plt.figure(figsize=(25,20))

tee= tree.plot_tree(clf,

feature_names=iris.feature_names,

class_names=iris.target_names,

filled=True)

Create and insert values using MySQL

import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="",
database="stud"
)

mycursor = mydb.cursor()

#mycursor.execute("CREATE TABLE customers (name VARCHAR(255),


address VARCHAR(255))")

#mycursor.execute("SHOW TABLES")
#mycursor = mydb.cursor()
#for x in mycursor:
# print(x)
sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = ("John", "Highway 21")
mycursor.execute(sql, val)

mydb.commit()

print(mycursor.rowcount, "record inserted.")

import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="",
database="stud"
)
mycursor = mydb.cursor()
s_name = input('Student Name:')
s_add = input('Address:')
mycursor.execute("INSERT INTO customers(name,address)VALUES (%s,%s)",
(s_name,s_add))
mydb.commit()
print ( 'Data entered successfully.' )
mydb.close()

You might also like