Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
217 views

Python DataScience Cheat-Sheet

This document provides a summary of key Python concepts for data science including: 1) Python basics like variables, data types, calculations and type conversions. 2) Common data structures like lists and NumPy arrays - how to access elements, slice, operate on them and common methods. 3) Popular Python libraries for data analysis, machine learning and scientific computing like NumPy, Pandas and Matplotlib.

Uploaded by

Zain UL ABIDIN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
217 views

Python DataScience Cheat-Sheet

This document provides a summary of key Python concepts for data science including: 1) Python basics like variables, data types, calculations and type conversions. 2) Common data structures like lists and NumPy arrays - how to access elements, slice, operate on them and common methods. 3) Popular Python libraries for data analysis, machine learning and scientific computing like NumPy, Pandas and Matplotlib.

Uploaded by

Zain UL ABIDIN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries

>>> a = 'is' Import libraries


Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5 Select 3rd last item
>>> my_list[-3]
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables powered by Python with Anaconda documents with live code,
>>> my_list[:] Copy my_list
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations
>>> x**2 Exponentiation of a variable >>> my_array = np.array(my_list)
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
>>> my_list.index(a) Get the index of an item array([1, 2])

int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
float() 5.0, 1.0 Variables to floats >>> my_list.remove('!') Remove an item array([1, 4])
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True Variables to booleans >>> my_list.reverse() Reverse the list
>>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item
Asking For Help >>> my_list.insert(0,'!') Insert an item
>>> my_array * 2
array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists
>>> a.shape Array dimensions Subsetting
NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 1 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2] 1 2 3 Select items at index 0 and 1
2
Asking For Help
array([1, 2])
The NumPy library is the core library for scientific computing in >>> np.info(np.ndarray.dtype)
>>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
object, and tools for working with these arrays. >>> b[:1] Select all items at row 0
array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])
Boolean Indexing
1D array 2D array 3D array >>> np.subtract(a,b) Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3

axis 1 axis 2 array([[ 2.5, 4. , 6. ], array([1])


1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0
4 5 6 >>> a / b Division array([ 4. , 2. , 6. , 1.5])
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison Combining Arrays
array([True, False, False], dtype=bool)
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2, 3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions >>> np.vstack((a,b))
Saving & Loading On Disk array([[ 1. ,
[ 1.5,
2. ,
2. ,
3. ],
3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.genfromtxt("my_file.csv", delimiter=',') >>> np.std(b) Standard deviation [ 2, 15],
[ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type Sort an array


>>> np.unicode_ Fixed-length unicode type
>>> a.sort()
>>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays
>>> s.drop(['a', 'c'])
>>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com

>>> s['b'] Get one element Sort & Rank


Pandas
-5
>>> df.sort_index() Sort by labels along an axis
>>> df[1:] Get subset of a DataFrame >>> df.sort_values(by='Country') Sort by the values along an axis
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035

programming language.
2 Brazil Brasília 207847528
Retrieving Series/DataFrame Information
Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc[[0],[0]] Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.iat([0],[0])
>>>
>>>
df.info()
df.count()
Info on DataFrame
Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc[[0], ['Country']] Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5 'Belgium' column labels >>> df.cumsum() Cummulative sum of values
>>> df.min()/df.max() Minimum/maximum values
c 7
Index >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax() Minimum/Maximum index value
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
By Label/Position >>> df.median() Median of values
>>> df.ix[2] Select single row of
DataFrame Country
Capital
Brazil
Brasília
subset of rows Applying Functions
Population 207847528
Columns >>> f = lambda x: x*2
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index Select rows and columns
2 Brazil Brasília 207847528
>>> df.ix[1,'Capital']
Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data,
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0

I/O Arithmetic Operations with Fill Methods


You can also do the internal data alignment yourself with
Read and Write to CSV Read and Write to SQL Query or Database Table
the help of the fill methods:
>>> pd.read_csv( , header=None, nrows=5) >>> from sqlalchemy import create_engine
>>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:') a 10.0
>>> pd.read_sql("SELECT * FROM my_table;", engine) b -5.0
Read and Write to Excel c 5.0
>>> pd.read_sql_table('my_table', engine) d 7.0
>>> pd.read_excel( ) >>> pd.read_sql_query("SELECT * FROM my_table;", engine)
>>> df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')
read_sql()is a convenience wrapper around read_sql_table() and
Read multiple sheets from the same file
read_sql_query()
>>> xlsx = pd.ExcelFile( )
>>> df = pd.read_excel(xlsx, 'Sheet1') >>> df.to_sql('myDf', engine) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plotting library which produces >>>
>>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')

1 Prepare The Data Also see Lists & NumPy


>>> plt.show() Step 6

1D Data 4 Customize Plot


>>>
>>>
import numpy as np
x = np.linspace(0, 10, 100)
Colors, Color Bars & Color Maps Mathtext
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10))
>>> data2 = 3 * np.random.random((10, 10))
cmap='seismic') >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
2 Create Plot >>> plt.plot(x,y,linewidth=4.0)
>>> ax.legend(loc='best')
Ticks
No overlapping plot elements

>>> import matplotlib.pyplot as plt >>> plt.plot(x,y,ls='solid') >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks
>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>>
>>>
plt.plot(x,y,'--',x**2,y**2,'-.')
plt.setp(lines,color='r',linewidth=4.0)
>>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plotting is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
subplot will fit your needs. A subplot is an axes on a grid system.
style='italic') right=0.9,
>>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom axis line outward

3 Plotting Routines 5 Save Plot


Save figures
1D Data Vector Fields >>> plt.savefig('foo.png')
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width) >>> axes[0,1].streamplot(X,Y,U,V) Plot 2D vector fields
>>> axes[1,0].barh([0.5,1,2.5],[0,1,2])
6
Plot horiontal rectangles (constant height)
>>> axes[1,1].axhline(0.45) Draw a horizontal line across axes Data Distributions Show Plot
>>> axes[0,1].axvline(0.65) Draw a vertical line across axes >>> ax1.hist(y) Plot a histogram
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot >>> plt.show()
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
>>> plt.clf()
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt >>> sns.stripplot(x="species", Scatterplot with one y="sepal_length", model fit
data=iris,
>>> import seaborn as sns y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib

>>> g = sns.lmplot(x="tip", Step 3


Point Plot
palette="Greens_d")
Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> uniform_data = np.random.rand(10, 12) >>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data >>> plt.xlabel("Sex") Adjust the label of the x-axis
>>> data = pd.DataFrame({'x':np.arange(1,101),
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
>>> sns.axes_style("whitegrid")
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
Support Vector Machines (SVM) >>> accuracy_score(y_test, y_pred)

Scikit-learn >>> from sklearn.svm import SVC Classification Report


>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> gnb = GaussianNB() >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
Mean Absolute Error
Unsupervised Learning Estimators
>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.model_selection import train_test_split >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train)
K Means >>> mean_squared_error(y_test, y_pred)
>>>
>>>
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>>
>>>
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data >>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> from sklearn.metrics import homogeneity_score
>>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
Supervised Estimators >>> from sklearn.cross_validation import cross_val_score
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))
>>> y_pred = lr.predict(X_test) Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
Preprocessing The Data
>>> params = {"n_neighbors": np.arange(1,3),
"metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> enc = LabelEncoder()
>>> print(grid.best_estimator_.n_neighbors)
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
Subtraction
SciPy >>>
>>>
B
C
=
=
np.asmatrix(b)
np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
>>> np.divide(A,D) Division
scientific computing that provides mathematical Basic Matrix Routines Multiplication
algorithms and convenience functions built on the Multiplication operator
Inverse >>> A @ D
NumPy extension of Python. >>> A.I Inverse (Python 3)
>>> linalg.inv(A) Inverse >>> np.multiply(D,A) Multiplication
Dot product
Interacting With NumPy Also see NumPy Transposition
>>>
>>>
np.dot(A,D)
np.vdot(A,D) Vector dot product
>>> import numpy as np >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> A.H Conjugate transposition >>> np.outer(A,D) Outer product
>>> a = np.array([1,2,3])
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)]) Trace >>> np.tensordot(A,D) Tensor dot product
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
Norm Exponential Functions
Index Tricks Matrix exponential
>>> linalg.norm(A) Frobenius norm >>> linalg.expm(A)
>>> np.mgrid[0:5,0:5] Create a dense meshgrid >>> linalg.norm(A,1) L1 norm (max column sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
Create an open meshgrid >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> np.ogrid[0:2,0:2] >>> linalg.norm(A,np.inf) L inf norm (max row sum) decomposition)
>>> np.r_[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise)
>>> np.c_[b,c] Create stacked column-wise arrays Rank Logarithm Function
>>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
Shape Manipulation Determinant Trigonometric Functions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
>>> np.transpose(b) Permute array dimensions Matrix cosine
>>> b.flatten() Flatten the array Solving linear problems >>> linalg.cosm(D)
Solver for dense matrices >>> linalg.tanm(A) Matrix tangent
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.solve(A,b)
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index >>> linalg.lstsq(F,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vpslit(d,2) Split the array vertically at the 2nd index equation >>> linalg.coshm(D) Hyperbolic matrix cosine
Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
Polynomials >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
>>> from numpy import poly1d (least-squares solver) >>> np.signm(A) Matrix sign function
>>> p = poly1d([3,4,5]) Create a polynomial object >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix Matrix Square Root
(SVD) >>> linalg.sqrtm(A) Matrix square root
Vectorizing Functions Creating Sparse Matrices Arbitrary Functions
>>> def myfunc(a): >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
if a < 0: >>> F = np.eye(3, k=1) Create a 2X2 identity matrix
return a*2
else:
>>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
>>> C[C > 0.5] = 0
Eigenvalues and Eigenvectors
return a/2
>>> H = sparse.csr_matrix(C) Compressed Sparse Row matrix
>>> np.vectorize(myfunc) Vectorize functions >>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
Type Handling >>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
>>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> np.real(b) Return the real part of the array elements >>> v[:,1] Second eigenvector
Return the imaginary part of the array elements
>>>
>>>
np.imag(b)
np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
Cast object to a data type Singular Value Decomposition
>>> np.cast['f'](np.pi)
Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
Inverse
Other Useful Functions >>> sparse.linalg.inv(I) >>> M,N = B.shape
Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> sparse.linalg.norm(I) Norm LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values Solving linear problems >>> P,L,U = linalg.lu(C) LU Decomposition
(number of samples)
>>> g [3:] += np.pi >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>>
>>>
np.logspace(0,10,3) Create an array of evenly spaced values (log scale)
np.select([c<4],[c*2]) Return values from a list of arrays depending on
Sparse Matrix Functions >>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
conditions >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
>>> misc.factorial(a) Factorial
Combine N things taken at k time
>>>
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
Learn Python for Data Science Interactively
>>> np.info(np.matrix)

You might also like