Python Libraries Seminar Report
Python Libraries Seminar Report
Python Libraries Seminar Report
ON
PYTHON LIBRARIES FOR DATA SCIENCE
1.Introduction
Python features a dynamic type system and automatic memory management. It supports
multiple programmingparadigms,including object-
oriented, imperative, functional and procedural, and has a large and
comprehensive standard library.
Python interpreters are available for many operating systems. CPython, the reference
implementation of Python, is open source software and has a community-based
development model, as do nearly all of Python's other implementations. Python and
CPython are managed by the non-profit Python Software Foundation.
Python has a simple, easy to learn syntax emphasizes readability hence, it reduces the cost
of program maintenance.
Also, Python supports modules and packages, which encourages program modularity and
code reuse.
1.1Advantages of using PYTHON
The diverse application of the Python language is a result of the combination of features
which give this language an edge over others. Some of the benefits of programming in
Python include:
The Python Package Index (PPI) contains numerous third-party modules that make
Python capable of interacting with most of the other languages and platforms.
Python provides a large standard library which includes areas like internet protocols,
string operations, web services tools and operating system interfaces. Many high use
programming tasks have already been scripted into the standard library which reduces
length of code to be written significantly.
Python language is developed under an OSI-approved open source license, which makes
it free to use and distribute, including for commercial purposes.
Further, its development is driven by the community which collaborates for its code
through hosting conferences and mailing lists, and provides for its numerous modules.
Python offers excellent readability and uncluttered simple-to-learn syntax which helps
beginners to utilize this programming language. The code style guidelines, PEP 8,
provide a set of rules to facilitate the formatting of code. Additionally, the wide base of
users and active developers has resulted in a rich internet resource bank to encourage
development and the continued adoption of the language.
5. User-friendly Data Structures:
Python has built-in list and dictionary data structures which can be used to construct fast
runtime data structures. Further, Python also provides the option of dynamic high-level
data typing which reduces the length of support code that is needed.
Python has clean object-oriented design, provides enhanced process control capabilities,
and possesses strong integration and text processing capabilities and its own unit testing
framework, all of which contribute to the increase in its speed and productivity. Python
is considered a viable option for building complex multi-protocol network applications.
2. DATA SCIENCE:-
“Data science” is just about as broad of a term as they come. It may be easiest to describe
what it is by listing its more concrete components:
Included here: Pandas; NumPy; SciPy; a helping hand from Python’s Standard
Library.
2) Data visualization:- A pretty self-explanatory name. Taking data and turning it into
something colorful.
4) Deep learning:- This is a subset of machine learning that is seeing a renaissance, and
is commonly implemented with Keras, among other libraries. It has seen monumental
improvements over the last ~5 years, such as AlexNet in 2012, which was the first
design to incorporate consecutive convolutional layers.
5) Data storage and big data frameworks:-Big data is best defined as data that is
either literally too large to reside on a single machine, or can’t be processed in the
absence of a distributed environment. The Python bindings to Apache technologies
play heavily here.
NumPy is the fundamental package for scientific computing with Python. It contains among
other things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to
seamlessly and speedily integrate with a wide variety of databases.
NumPy is licensed under the BSD license, enabling reuse with few restrictions. The core
functionality of NumPy is its "ND array", for n-dimensional array, data structure. These
arrays are stride views on memory. In contrast to Python's built-in list data structure (which,
despite the name, is a dynamic array), these arrays are homogeneously typed: all elements of
a single array must be of the same type. NumPy has built-in support for memory-mapped
arrays.
1. zeros (shape [, dtype, order]) - Return a new array of given shape and type, filled with
zeros.
2. array (object [, dtype, copy, order, lubok, ndim]) - Create an array
3. as array (a [, dtype, order]) - Convert the input to an array.
4. As an array (a [, dtype, order]) - Convert the input to an ND array, but pass ND array
subclasses through.
Numpy will help you to manage multi-dimensional arrays very efficiently. Maybe you
won’t do that directly, but since the concept is a crucial part of data science, many other
libraries (well, almost all of them) are built on Numpy. Simply put: without Numpy you
won’t be able to use Pandas, Matplotlib, Scipy or Scikit-Learn. That’s why you need it on
the first hand.
3.2 Pandas
The name Pandas is derived from the word Panel Data – an Econometrics from
Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin
of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data
3.3 Matplotlib
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object-oriented interface or via a set of functions
familiar to MATLAB users.
The best and most well-known Python data visualization library is Matplotlib. I wouldn’t say
it’s easy to use… But usually if you save for yourself the 4 or 5 most commonly used code
blocks for basic line charts and scatter plots, you can create your charts pretty fast.
3.4 SciPy
SciPy is a machine learning library for application developers and engineers. However, you
still need to know the difference between SciPy library and SciPy stack. SciPy library
contains modules for optimization, linear algebra, integration, and statistics.
3.4.1Features Of SciPy:-
The main feature of SciPy library is that it is developed using NumPy, and its array makes the
most use of NumPy.
In addition, SciPy provides all the efficient numerical routines like optimization, numerical
integration, and many others using its specific submodules.
SciPy is a library that uses NumPy for the purpose of solving mathematical functions. SciPy
uses NumPy arrays as the basic data structure, and comes with modules for various
commonly used tasks in scientific programming.
Tasks including linear algebra, integration (calculus), ordinary differential equation solving
and signal processing are handled easily by SciPy.
3.5Scikit-Learn
Scikit-learn (formerly scikits. learn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and
DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries
NumPy and SciPy.
The scikit-learn project started as scikits.learn, a Google Summer of Code project by David
Cournapeau. Its name stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately-
developed and distributed third-party extension to SciPy. The original codebase was later
rewritten by other developers. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre
Gramfort and Vincent Michel, all from INRIA took leadership of the project and made the
first public release on February the 1st 2010. Of the various scikits, scikit-learn as well as
scikit-image were described as "well-maintained and popular" in November 2012.
As of 2018, scikit-learn is under active development.
Scikit-learn is largely written in Python, with some core algorithms written in Cython to
achieve performance. Support vector machines are implemented by a Cython wrapper around
LIBSVM; logistic regression and linear support vector machines by a similar wrapper around
LIBLINEAR. [10]
Primary Intent: Developing and training deep learning models, deep learning research
Considered to be one of the coolest machine learning Python libraries, Keras offers an easier
mechanism for expressing neural networks. It also features great utilities for compiling
models, processing datasets, visualizing graphs, and much more.
Written in Python, Keras has the ability to run on top of CNTK, TensorFlow, and Theano.
The Python machine learning library is developed with a primary focus on allowing fast
experimentation. All Keras models are portable.
Compared to other Python machine learning libraries, Keras is slow. This is due to the fact
that it creates a computational graph using the backend infrastructure first and then uses the
same to perform operations. Keras is very expressive and flexible for doing innovative
research.
3.6.1 Highlights:
Basically a data visualization library for Python, Seaborn is built on top of the Matplotlib
library. Also, it is closely integrated with Pandas data structures. The Python data
visualization library offers a high-level interface for drawing attractive as well as informative
statistical graphs.
The main aim of Seaborn is to make visualization a vital part of exploring and understanding
data. Its dataset-oriented plotting functions operate on arrays and data-frames containing
whole datasets. The library is ideal for examining relationships among multiple variables.
Seaborn internally performs all the important semantic mapping and statistical aggregation
for producing informative plots. The Python data visualization library also has tools for
choosing among color palettes that aid in revealing patterns in a dataset.
3.7.1 Highlights:
Anybody involved in machine learning projects using Python must have, at least, heard of
TensorFlow. Developed by Google, it is an open source symbolic math library for numerical
computation using data flow graphs.
The mathematical operations in a typical TensorFlow data flow graph are represented by the
graph nodes. The graph edges, on the other hand, represent the multidimensional data arrays,
a.k.a. tensors, that flow between the graph nodes.
Widely used Google products like Google Photos and Google Voice Search are built using
TensorFlow. The library has a complicated front-end for Python. The Python code will get
compiled and then executed on TensorFlow distributed execution engine.
3.8.1 Highlights:
Allows training multiple neural networks and multiple GPUs, making models very
efficient for large-scale systems
Easily trainable on CPU and GPU for distributed computing
Flexibility in its operability, meaning TensorFlow offers the option of taking out the
parts that you want and leaving that you don’t
Great level of community and developer support
Unlike other data science Python libraries, TensorFlow simplifies the process of
visualizing each and every part of the graph