Numpy and Pandas
Numpy and Pandas
Pandas is a very popular library for working with data (its goal is to be the
most powerful and flexible open-source tool, and in our opinion, it has
reached that goal). DataFrames are at the center of pandas. A DataFrame is
structured like a table or spreadsheet. The rows and the columns both have
indexes, and you can perform operations on rows or columns separately.
NumPy
NumPy is an open-source Python library that facilitates efficient numerical
operations on large quantities of data. There are a few functions that exist in
NumPy that we use on pandas DataFrames. For us, the most important part
about NumPy is that pandas is built on top of it. So, NumPy is a dependency
of Pandas.
import numpy as np
import pandas as pd
list1 = [1,2,3,4]
array1 = np.array(list1)
print(array1)
list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)
toyPrices = [5,8,3,6]
# print(toyPrices - 2) -- Not possible. Causes an error
for i in range(len(toyPrices)):
toyPrices[i] -= 2
print(toyPrices)
# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13,25,19])
series1 = pd.Series(ages)
print(series1)
# Create a Series using a NumPy array of ages but customize the indices to be the
names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)
dataf = pd.DataFrame([
['John Smith','123 Main St',34],
['Jane Doe', '456 Maple Ave',28],
['Joe Schmo', '789 Broadway',51]
],
columns=['name','address','age'])
Standard Deviation
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
Try it Yourself »
# creating an empty list
lst = []
ele = int(input())
lst.append(ele)
print(lst)
Given an array containing marks of students, the task is to calculate the percentile of the students. The
percentile is calculated according to the following rule:
The percentile of a student is the % of the number of students having marks less than him/her.
Examples:
import numpy as np
i, j = 0, 0
count, percent = 0, 0
while i < n:
count = 0
j=0
while j < n:
count += 1
j += 1
i += 1
# Driver Code
#StudentMarks = [12, 60, 80, 71, 30]
n = len(StudentMarks)
percentile(StudentMarks, n)
x= np.percentile(StudentMarks, p)
Histogram
import numpy as np
N = 10000
x = stats.norm.rvs(size=N)
num_bins = 20
ax=plt.axes()
y = np.linspace(-4, 4, 1000)
plt.show()
pdf() for displaying the probability density function. This pdf() method present inside
the scipy.stats.norm.
The most important step of the analysis is performed even before the analysis begins. In text book problems we assume that we
know the variables between which we have to find correlation. However in real life, there are many variables and therefore many
cases of correlation possible. Selecting the variables in between there exists a material relationship that if understood will benefit the
process is important.
Once the variables have been selected, relevant data needs to be collected to draw meaningful conclusions about the same. This
can be done by applying the relevant design of experiment and coming up with measurements that will be used as inputs into the
system. This process like every other follows the principle of GIGO i.e. Garbage In Garbage Out and hence due care must be taken
regarding the input data.
Once the data has been collected, it must be mapped on the X and Y axes of the Cartesian Co-ordinate system. This will give the
viewer an idea about where the majority of the points are centred, where the outliers are and why this is the case. Nowadays, this
does not have to be done manually. There are software available that will automatically fetch the incoming data real time and map it
on to a scatter plot.
The next step is to statistically compute the line of best fir for the scattered data points. This means that mathematically a line will be
worked out that fits through most of the lines and is closest to the rest of them. This line has an equation that can be used to predict
the nature of relationship between the variable. This step too, early required complex calculations, prone to human error. Now
software can do it seamlessly and in no time.
The next step is to come up with a co-relation co-efficient. This number as stated earlier is the best metric to understand correlation
and lies between -1 and +1. The software will work out and give you a correlation co-efficient. Expensive software are not required.
Something as simple as an excel sheet can be used.
The last step is to interpret the number. Anything above + or – 0.5 suggests a strong correlation. 0 represents no correlation while -1
or +1 represents perfect co-relation. Perfect correlation may be an indicator for causation. However, it does not imply causation, all
by itself.
Scatter Plot Uses and Examples
Scatter plots instantly report a large volume of data. It is beneficial in the following situations –
import pandas as pd
insurance_filepath = "D:/smoke.csv"
insurance_data = pd.read_csv(insurance_filepath)
insurance_data.head()
print(insurance_data)
Polynomial Regression
import numpy as nm
import pandas as pd
#importing datasets
data_set= pd.read_csv('D:/reg.csv')
y= data_set.iloc[:, 2].values
poly_regs= PolynomialFeatures(degree=2)
x_poly= poly_regs.fit_transform(x)
lin_regs =LinearRegression()
lin_regs.fit(x_poly, y)
mtp.scatter(x,y,color="blue")
mtp.title("Polynomial Regression")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = DecisionTreeClassifier(random_state=1234)
model = clf.fit(X, y)
text_representation = tree.export_text(clf)
print(text_representation)
fig = plt.figure(figsize=(25,20))
tee= tree.plot_tree(clf,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="",
database="stud"
)
mycursor = mydb.cursor()
#mycursor.execute("SHOW TABLES")
#mycursor = mydb.cursor()
#for x in mycursor:
# print(x)
sql = "INSERT INTO customers (name, address) VALUES (%s, %s)"
val = ("John", "Highway 21")
mycursor.execute(sql, val)
mydb.commit()
import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="",
database="stud"
)
mycursor = mydb.cursor()
s_name = input('Student Name:')
s_add = input('Address:')
mycursor.execute("INSERT INTO customers(name,address)VALUES (%s,%s)",
(s_name,s_add))
mydb.commit()
print ( 'Data entered successfully.' )
mydb.close()