Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Python Datasci Slides

Uploaded by

mira.jeni1love
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Python Datasci Slides

Uploaded by

mira.jeni1love
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Teaching Data Science using Python II

The Python Data Science Ecosystem


Teaching Data Science using Python

Sandbox option: Berkeley's Data 8 course


• Uses the datascience package

Real world option uses a set of Python


packages:
• Standard Python libraries
• NumPy
• Pandas
• Matplotlib
• Also: seaborn, statsmodels, scikitlearn
Python Data Science packages

Going to give a basic overview of some of the main Python Data


Science packages

Will redo the avocado analyses using some of these packages


NumPy is a library that adds support for large, multi-dimensional arrays and
matrices, along with a large collection of high-level mathematical functions
to operate on these arrays.
• i.e., it is similar to MATLAB

The core data structure of NumPy is its "ndarray".

Ndarrays are similar to Python lists, except that all elements in an ndarray
must of the same type
• E.g., all elements are numbers, or all elements are strings, etc.
import numpy as np SciPy contains modules for
optimization, linear algebra,
x = np.array([1, 2, 3]) integration, interpolation, FFT, signal
2 * x and image processing, etc.
• Uses ndarrays as main data structure
# the numbers 0 to 9
x = np.arange(10)

# 3 x 3 matrix
M = np.array([[1, 2, 3], [3, 4, 6.7], [5, 9.0, 5]])
pandas is a library for data manipulation and analysis that has two main
data structures:

1. Series: One-dimensional ndarray with an index for each value


• Similar to a named vector in R

2. DataFrame: Two-dimensional, size-mutable, potentially


heterogeneous tabular data.
• Similar to an R data frame
• (or multiple Series of the same length with the same index)
import pandas as pd
avocado = pd.read_csv("avocado.csv")
avocado.head(3) # show the first 3 rows

avocado["AveragePrice"] # returns a series

# Get the average value for all numerical


columns separately for each type of avocado
avocado.groupby("type").mean().reset_index()
Matplotlib is a plotting library. Each plot has a figure and a number of different
subplots (axes).
• somewhat similar to base R graphics

It has two interfaces for plotting:

1. A "pylab" procedural interface based on a state machine that closely resembles


MATLAB
• Updates are made to the most recent axis plotted on

2. An object-oriented API
• Updates are made to the axis that is selected

The objected oriented interface is preferred (not a big difference)


import matplotlib.pyplot as plt

# pylab interface (like matlab)


plt.plot([1,3,10]);

# object oriented interface


fig, ax = plt.subplots()
ax.plot([1,3,10]);
seaborn is a visualization library built off Matplotlib, but it provides a
higher level interface that uses Pandas DataFrames
• somewhat similar to ggplot
Figure level plots
There are "axes-level" functions that plot
on a single axis and "figure-level"
functions that plot across multiple axes

Figure level plots are grouped based on


the types of variables being plotted
• E.g., a single quantitative variable, two
quantitative variables, etc.
import seaborn as sns
penguins = sns.load_dataset("penguins")

# figure-level plot
sns.displot(data=penguins,
x="flipper_length_mm",
hue="species",
multiple="stack",
kind="kde");
Translation between Tables and DataFrames

Translation between datascience Tables and pandas DataFrames

Translation between datascience Tables and babypandas DataFrames


Let’s try it ourselves!

You might also like