Data Science using Python_ Introduction
Data Science using Python_ Introduction
What is Pandas?
Pandas is an open-source Python Library used for high-performance data manipulation and
data analysis using its powerful data structures. Python with pandas is in use in a variety of
academic and commercial domains, including Finance, Economics, Statistics, Advertising,
Web Analytics, and more.
Using Pandas, we can accomplish five typical steps in the processing and analysis of data,
regardless of the origin of data — load, organize, manipulate, model, and analyse the data.
Key Features of Pandas
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
Pandas deals with the following three data structures −
Series
DataFrame
These data structures are built on top of Numpy array, making them fast and efficient.
Dimension & Description
The best way to think of these data structures is that the higher dimensional data structure is a
container of its lower dimensional data structure. For example, DataFrame is a container of
Series, Panel is a container of DataFrame.
Data Dimensions Description
Structure
Name String
Age Integer
Gender String
Rating Float
What is NumPy?
NumPy is a Python package which stands for 'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of routines for processing of array.
Operations using NumPy
Using NumPy, a developer can perform the following operations −
Mathematical and logical operations on arrays.
Fourier transforms and routines for shape manipulation.
Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.
NumPy – A Replacement for MatLab
NumPy is often used along with packages like SciPy (Scientific Python)
and Mat−plotlib (plotting library). This combination is widely used as a replacement for
MatLab, a popular platform for technical computing. However, Python alternative to MatLab
is now seen as a more modern and complete programming language.
It is open source, which is an added advantage of NumPy.
ndarray Object
The most important object defined in NumPy is an N-dimensional array type called ndarray.
It describes the collection of items of the same type. Items in the collection can be accessed
using a zero-based index. Every item in an ndarray takes the same size of block in the
memory. Each element in ndarray is an object of data-type object (called dtype). Any item
extracted from ndarray object (by slicing) is represented by a Python object of one of array
scalar types.
What is SciPy?
The SciPy library of Python is built to work with NumPy arrays and provides many user-
friendly and efficient numerical practices such as routines for numerical integration and
optimization. Together, they run on all popular operating systems, are quick to install and are
free of charge. NumPy and SciPy are easy to use, but powerful enough to depend on by some
of the world's leading scientists and engineers.
SciPy Sub-packages
SciPy is organized into sub-packages covering different scientific computing domains. These
are summarized in the following table −
scipy.constants Physical and mathematical constants
scipy.interpolate Interpolation
scipy.optimize Optimization
scipy.stats Statistics
Data Structure
The basic data structure used by SciPy is a multidimensional array provided by the NumPy
module. NumPy provides some functions for Linear Algebra, Fourier Transforms and
Random Number Generation, but not with the generality of the equivalent functions in SciPy.
What is Matplotlib?
Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It
has a module named pyplot which makes things easy for plotting by providing feature to
control line styles, font properties, formatting axes etc. It supports a very wide variety of
graphs and plots namely - histogram, bar charts, power spectra, error charts etc. It is used
along with NumPy to provide an environment that is an effective open source alternative for
MatLab. It can also be used with graphics toolkits like PyQt and wxPython.
Conventionally, the package is imported into the Python script by adding the following
statement −
from matplotlib import pyplot as plt
Matplotlib Example
The following script produces the sine wave plot using matplotlib.
Example
import numpy as np
import matplotlib.pyplot as plt