Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
62 views

Data Science Python Cheat Sheet

This document provides a cheat sheet on key Python libraries and packages for data science, including: - Numpy for numerical computing and arrays, with functions for data manipulation, aggregation, random number generation, etc. - Matplotlib for data visualization and plotting graphs. It allows customizing plots with labels, titles, legends. - Pandas for data structures and data analysis, with capabilities like loading data, selecting columns, handling missing data, merging/concatenating tables. - Scikit-learn for machine learning tasks like classification, regression, clustering and model tuning. It supports algorithms like linear regression and preprocessing tools. - Seaborn for statistical data visualization built on top of Matplotlib, with visualizations like joint
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Data Science Python Cheat Sheet

This document provides a cheat sheet on key Python libraries and packages for data science, including: - Numpy for numerical computing and arrays, with functions for data manipulation, aggregation, random number generation, etc. - Matplotlib for data visualization and plotting graphs. It allows customizing plots with labels, titles, legends. - Pandas for data structures and data analysis, with capabilities like loading data, selecting columns, handling missing data, merging/concatenating tables. - Scikit-learn for machine learning tasks like classification, regression, clustering and model tuning. It supports algorithms like linear regression and preprocessing tools. - Seaborn for statistical data visualization built on top of Matplotlib, with visualizations like joint
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Science

Python cheat sheet


numpy | panda | matplotlib | scipy | seaborn | tensorflow
Numpy:

Numpy data types:

 i - integer
 b - boolean
 u - unsigned integer
 f - float
 c - complex float
 m - timedelta
 M - datetime
 O - object
 S - string
 U - unicode string
 V - fixed chunk of memory for other type ( void )

np.array() : to make array


np.array([[1, 2, 3], [4, 5, 6]]) : 2d array
np.array([1, 2, 3, 4], ndmin=<n>) : create array of n dimensions
np.full((3,3),4,dtype=int) : to make an array full of same value
n.eye(5,5) : identity matrix
np.zeroes(n,m) and np.ones(n,m) : array of zeroes and ones
np.arange() : to make array for ap
np.linspace() : to make array of equal space

array.ndim : check no. of dimensions


arr.dtype : check data type
arr.astype('i') : change data type

np.mean(arr) : get mean


np.median(arr) : get median
np.mode(arr) : get mode
np.std(arr) : get standard deviation
np.var(arr) : get variance
np.percentile(arr,percentage) : get max value as per percentile
np.sum(arr) : get sum to all elements

arr.copy() : to make a copy (non updatable)


arr.view() : to make a view (updatable)
arr.shape : to get the dimensions
arr.reshape() : to reshape the array
np.transpose(arr) : transpose of matrix
reshape(n,m,-1) : -1 for unknown value
reshape(-1) : to convert any array to 1D

for x in np.nditer(arr): to iterate array in numpy


for idx, x in np.ndenumerate(arr): to iterate array with index

np.concatenate((arr1, arr2), axis=1) : to join two array


np.stack((arr1, arr2), axis=1) : to stack two array (also check hstack, vstack and dstack)

np.array_split(arr, n, axis=1) : to divide array in n no. of array (also check hsplit, vsplit and dsplit)

np.where(arr == n) : to find all indices if n


np.searchsorted(arr, 7, side='right') : to return the index where a value should be placed to keep the
array sorted
np.sort(arr) : to sort array
np.around(arr, dec) : to round up the array elements to the specified decimal places (also check
np.floor(arr) and np.ceil(arr))

np.random.uniform(a, b, n) : give n no. of uniform values between a-b


np.random.normal(a, b, n) : give n no. of normal values between a-b
random.randint(n) : to print a random number between 0 to n
random.rand(n) : random n no. of floats between 0 to 1
random.randint(n, size=(m)) : to give m no. of int between 0 to n
random.choice([3, 5, 7, 9], size=(3, 5)) : one out of the choice of give size

random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=(100)) : probability


Matplotlib

matplotlib.pyplot as plt
plt.plot(x,y,label=“linename”, marker = ‘’, color = ‘’, linestyle = ‘’, lw=) : plots a graph using arrays
x and y
plt.plot(x1,y1,x2,y2) : plots a graph of 2 lines
plt.xlabel(“labname”) and plt.ylabel(“labname”) : to put labels for axis
plt.title(“title”) : give title
plt.xlim(n,m) and plt.ylim(n,m) : to set upper and lower limit of graph
plt.axis(y0,yn,x0,xn) : to set upper and lower limit
plt.show() : show graph
plt.grid() : show grid lines in graph (use grid(axis=‘x|y’))
plt.subplot(x,y,z) : plots many graphs (x:no of rows, y: no of columns, z: graph no.)
plt.legend(title=“title”,loc=“”) : to show legend in the graph
ply.figure(figsize=(n,m)) : to set graph size

plt.hist(x, n,color= ‘’) : plot histogram with n no. of bars


plt.scatter(x,y,color= “”, size=,alpha=,cmap=) : plot points on coordinates x,y
plt.bar(x, y,color=,width=,height=,label= ‘’) : plot bar graph of x values and y no. (use barh() for
horizontal)
ply.xticks(x,y,rotation) : to change the value of array x on x axis with array y
plt.pie(x,labels=arr,startangle=,explode=,shadow=,colors=, autopct= ‘%.2f%%’) : to plot pie chart
plt.boxplot(arr) : to get a boxplot of the data
Seaborn

sb.displot(tab[‘col’],kde= true, hist= true, bin= n) : to get a displot with n bins


sb.kdeplot(arr,shade=true) : to display a kernel density estimation plot
sb.boxplot(y= ‘colname’, data= tab) : to get a boxplot
sb.violinplot(y= ‘colname’, data= tab) : to get a violin plot (plot to check symmetry)
sb.jointplot(x= ‘colname1’, y= ‘colname2’, data= tab, kind= ‘’, size= n, color= ‘’) : to get a joint plot
sb.lmplot(‘colname1’, ‘colname2’, data= tab, hue= ‘colname3’, col= ‘colname4’, row= ‘colname5’) : to get
a scatter plot between 2 columns
sb.countplot(x= ‘colname1’, data= tab, hue= ‘colname2’) : to get a count bar graph between 2 columns
sb.boxplot(x= ‘colname1’, y= ‘colname2’, data= tab) : to get a boxplot between 2 columns
sb.heatmap(tab.corr()) : to get a heat map of all correlations in the table
sb.pairplot(tab) : to get a scatter plot of all pairs of columns in the table
Pandas

pd.Series([values], index = [values], columns = [values]) : to create a series like np.array(), but also
like dictionary
arr.index : to display index
arr.values : to display value
arr.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True) : to count each
value
arr.apply(np.fuction) : to apply functions of numpy to each value of panda array

to use with datetime (import datetime as dt)


arr.dt.year : returns the year of the date time.
arr.dt.month : returns the month of the date time.
arr.dt.day : returns the day of the date time.
arr.dt.quarter : returns the quarter of the date time.
arr.dt.dayofweek : returns the day of the week.
arr.dt.weekday_name : returns the name of the day of the week.
arr.dt.hour
arr.dt.second
arr.dt.minute
arr.dt.dayofyear
arr.dt.date
arr.dt.time
arr.dt.freq
arr.weekofyear
pd.to_datetime(‘date and time’) : to give date and time in a proper format
pd.date_range(start= ‘date1’, end= ‘date2’, freq= ‘D’) : to display dates between two dates

arr.dropna() : to drop empty values


arr.fillna(value) : to fill empty values with given value
arr.map({‘value’: ‘newvalue’}) : to replace/map values

pd.readtable(r“addr”, sep=“seperator”,names= <array of col names>,usecols=[name or index],


dtype={“colname” : type}) : to read a table in given address
tab.head(n) and tab.tail(n) : to give first or last n number of columns

tab[‘colname’] : to access the give column (can be used to create a new column, like tab[‘newcol’] =
tab.col1 + ’,’ + tab.col2)
tab.colname : to access the give column
tab.shape : no. of rows and columns

tab.colname.unique() : names of unique columns


tab.colname.nunique() : number of unique columns
tab.columns : all columns

tab.describe(include= ‘all’) : describe data in table (use describe(include= ‘all’) to get all data)
tab.info() : describe table
tab.dtypes : data type of each column
tab.col.value_counts(normalize=True, dropna=True) : to count number of rows of the column
tab.plot(kind= ‘’,x=“namexplot”,y=“nameyplot”) : to plot on graph (use with matplotlib)

tab[colname].mean() : to get mean { also look for mode(),median(),max() }


tab.corr() : correlation of each column with themselves
tab.colname.agg([‘functions’]) : to run functions aggregately on a column (like mean, median, count)
tab.col.quantile(0.25|0.75) : to get a quantile
tab.groupby(‘colname’) : to group entries by column name
pd.crosstab(tab.col1,tab.col2,margin=True) : to get crosstable between 2 columns
tab1.merge(tab2, on= ‘colname’, how= ‘’) : to merge to tables on common columns (how= left, right, inner,
outer)
pd.concat([tab1, tab2], axis=0) : to concatenate 2 tables
tab.set_index(‘colname’) : to set a column as index

tab.append(dict, ignore_index=True, sort=False) : to append data in table using a dictionary.


tab.loc[len(tab.index)]=list(dict[0].values()) : to add data in tab
tab.iloc[n]=list(dict[0].values()) : to replace data using a dictionary
tab.assign(**{‘newcol’ : arr) : to add a new column using an array
tab.rename(columns={“oldname”: “newname”},inplace = True) : change column names
tab.colname.astype(type) : to change type of values

tab.drop(‘colname’,axis = 1, inplace= True) : to drop a column


tab.drop_duplicates(keep= ‘first’, inplace= True) : to drop duplicate value rows
tab.drop(index ,axis = 0, inplace= True) : to drop a row at the given index
tab.dropna(subset= [“col”], how=“value”) : to drop all rows with any empty data {values: all, any}
tab[‘col’].interpolate(method=‘linear’) : to fill none value with estimated values
tab.duplicated() : to check for duplicate values
tab.isna() : to check for null values (also check tab.notna(), tab.isnull() and tab.notnull())
tab[‘colname’].fillna(value, inplace=true) : to replace null value to given value
tab.replace(to_replace= “val1”,replace= “val2” ) : replace val1 with val2 anywhere in the table
tab[‘col’].replace({‘val’: ‘valnew’}) : to replace a value in a row
tab.colname.sort_values(ascending= “true”).head() : to sort values
tab.sort_values(‘colname’,ascending= “true”).head() : to sort by a column
tab[tab.colname>=200].colname : conditional statement

tab.loc[index or col conditional, ‘colname’] : to show selected column


tab.iloc[[index or rows],[index of columns]] : to show selected rows and columns are
tab.col.isin([array of values]) : to select rows with specific values
tab.str.method() : to use a string method on rows
pd.get_dummies(tab.col, prefix=“value” prefix_step= ‘_’, drop_first=False) : to show k values with k-1
entries
tab = pd.get_dummies(tab,columns=[‘col’]) : same as one hot encoder
Scipy

from scipy.stats import bernoulli


bernoulli.rvs(p=0.5, size=n) : to print a Bernoulli series of size n and p
Sklearn

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(tab,arr, test_size = 0.3, random_state = 0): to get
data for training and testing

classifier.score(x_train, y_train): to get score of accuracy

from sklearn.feature_selection import VarianceThreshold


sel = VarianceThreshold(threshold=0.01)
sel.fit(tab) : to get data with the given threshold

from sklearn.linear_model import LinearRegression


lr = LinearRegression()
lr.fit(x, tab.col) : to get a linear regression (x can be multiple columns of the table)
lr.score(x, tab.col) : to get r2 score (x can be multiple columns of the table)
y_predict = lr.predict(x) : to predict values
lr.coef_ , lr.intercept_ = m, c #y = mx + c (m is array if multiple regression)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train, y_train) : to get logistic regression
y_pred = classifier.predict(x_test)

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf' | ‘linear’ | ‘poly’, c = [0.01,0.1,1,10], gamma = [0.01,0.1,1])
classifier.fit(x_train, y_train) : to get svm
y_pred = classifier.predict(x_test)

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=n) : to get with n neighbors
classifier.fit(x_train, y_train) : to get K neighbor Classification
y_pred = classifier.predict(x_test)

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = ‘gini’ | ‘entropy’, max_depth = [2,3,4],
min_samples_split = [int])
classifier.fit(x_train, y_train) : to get decision tree classification
y_pred = classifier.predict(x_test)
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(criterion = ‘gini’ | ‘entropy’, n_estimators = [int], max_depth = [3,4],
min_samples_split = [5,7], random_state=43)      
clr_rf = clf_rf.fit(x_train,y_train) : to get random forest
y_pred = clr_rf.predict(x_test)

from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(x_train, y_train) : to get XGB classification
y_pred = classifier.predict(x_test)

from sklearn.svm import SVR
clf = SVR()
clf.fit(X_train, y_train) : to run svr
predicted = clf.predict(X_test)

import xgboost as xgb
clf = xgb.XGBRegressor()
clf.fit(X_train, y_train) : to get XGB regression
predicted = clf.predict(X_test)
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf.fit(X_train, y_train) : to get decision tree regression
predicted = clf.predict(X_test)

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X_train, y_train) : to get random forest regression
predicted = clf.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


mean_absolute_error(y,y_predict) : give mean absolute error between actual and predicted values
mean_squared_error (y,y_predict) : give mean squared error between actual and predicted values
np.sqrt(mean_squared_error(y,y_predict)) : give root mean absolute error between actual and predicted
values
r2_score(y,y_predict) : give R2 score between actual and predicted values (same ar lr.score(x,tab.col))
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_true,y_predict) : gives confusion matrix
[TP FP]
[FN TN]
cm = classification_report(y_true,y_predict) : gives report of classification

from sklearn.metrics import f1_score, accuracy_score
ac = accuracy_score(y_test,clf_rf.predict(x_test)) : to get accuracy score

from sklearn.feature_selection import SelectKBest, chi2
# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)
print('Score list:', select_feature.scores_) : to get best scores

from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()      
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train) : to get RFE ranks

from sklearn.feature_selection import RFECV
# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier() 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_]) : to get rfecv

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
tab[‘col’] = lb.fit_transform(tab[‘col’])
for i in tab.columns:
tab[i] = lb.fit_transform(tab[i]) : to convert any entry of values to int form
from sklearn.preprocessing import OneHotEncoder
ob = OneHotEncoder()
for i in tab.columns:
tab[i] = ob.fit_transform(tab[i]) : to convert any entry of values to int form columns

from sklearn.manifold import TSNE
tn = TSNE(n_components=2, random_state=0)
xn = tn.fit_transform(tab)

from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(model, parameters, cv=n, scoring= ‘f1_macro’)
gs.fit(X_train,Y_train)

from sklearn.cluster import KMeans


import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X) : to get kmean clusters
kmeans.labels_ : labels for n clusters
ypredict = kmeans.predict(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=n)
tabpca = pca.fit_transform(tab) : to standardize a table into an array of n components

from sklearn.preprocessing import StandardScaler


sc = StandardScaler(n_components=n)
tabsc = sc.fit_transform(tab) : to standardize a table into an array
(scaled value = actual-mean/standard deviation)

from sklearn.preprocessing import MinMaxScaler


mns = MinMaxScaler((n,m) : to scale values between n and m
tabmns = mns.fit_transform(tab)
from sklearn.preprocessing import normalize
tabnorm = normalize(tab, norm= ‘l1’) : to normalize a table into an array
ss=[]
k=range(1,20)
for i in k:
km = KMeans(n_clusters=i)
km.fit(x)
ss.append(km.inertia_)
plt.plot(k,wss)
plt.show() : Elbow method

from sklearn.metrics import silhouette_score


silhouette_score(tabpca, KMeans(n_clusters=n).fit_predict(tabpca)) : to calculate silhouette score
(higher the score, better the clusters)

from sklearn.cluster import MeanShift


ms = MeanShift(bandwidth = 2, bin_seeding = True)
ms.fit(x) : to get Mean Shift clusters
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
gscv = GridSearchCV(classifier, param_grid = dict)
gscv.fit(x_train,y_train)
print(gscv.best_params_) : to get the best parameters for a classifier
print(gscv.best_estimator_) : to get the best estimators for a classifier
gscv.predict(x_test)

rmcv = RandomizedSearchCV (classifier, param_distributions = dict)


gscv.fit(x_train,y_train)
print(gscv.best_params_) : to get the best parameters for a classifier
print(gscv.best_estimator_) : to get the best estimators for a classifier
gscv.predict(x_test)

from sklearn.model_selection import LeaveOneOut, KFold, StratifiedKFold


n = np.array([1, 2, 6, 3, 2, 6, 8, 7, 3, 5, 8, 4, 3, 7, 3, 7])
km = KFold(n_splits=4, shuffle=False)
for train, test in km.split(n):
    print("Train: ", n[train], " Test: ", n[test])
n = np.array([1, 2, 6, 3, 2, 6, 8, 7, 3, 5, 8, 4, 3, 7, 3, 7])
y = np.array([1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0])
sk = StratifiedKFold(n_splits=4, shuffle=False)
for train, test in km.split(n, y):
    print("Train: ", n[train], " Test: ", n[test])

lt = LeaveOneOut()
for train, test in lt.split(n):
    print("Train: ", n[train], " Test: ", n[test])

You might also like