Guide Python Data Science
Guide Python Data Science
Guide Python Data Science
INTRODUCTION: THE BIG DATA DILEMMA Designed as a flexible general purpose language, Python
is widely used by programmers and easily learnt by
In recent years, the amount of data available to
statisticians. Its extensive libraries make it a powerful tool
companies has skyrocketed. According to IBM, 2.5 billion
for statistical analysis, and it is routinely used to integrate
gigabytes (GB) of data are created every day.1 With this
models into web applications and production databases.
massive influx comes new opportunities for companies
Beyond conventional data analysis, Python is the leading
to deliver greater customer experiences and get an edge
language for machine learning, changing how businesses
on their competition. To this end, enterprises have been
are operating in every industry.
investing in big data platforms such as Hadoop, Spark
and NoSQL databases. The greater challenge, though, is This guide provides a summary of Python’s attributes
not only collecting and storing data, but being able to in each of these areas, as well as considerations
derive meaningful insights from it, and operationalizing for implementing Python to drive new insights and
those insights to create business value. innovation from big data.
Behind these efforts are the programming languages In practice, data science teams use a combination of
used by data science teams to clean up and prepare languages to play to the strengths of each one, with
data, write and test algorithms, build statistical Python and R used in varying degrees. Below is a brief
models, and translate into consumable applications or comparison table highlighting each language in the
visualizations. In this regard, Python stands out as the context of data science.
language best suited for all areas of the data science and
machine learning framework.
1
Matthew Wall, “Big Data: Are you ready for blast-off?”, BBC News: http://www.bbc.com/news/business-26383058 1
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
Python R Java Go
Core Strengths Easy to use, multi- Made for statistics, large High-performance, wide Modern architecture, clean
purpose, large community community enterprise adoption reliable code, lightweight,
fast
Ideal Use Cases - Data analysis - Data analysis Production systems - Production systems
- Visualization - Visualization - Data analysis, data
- Exploratory analysis - Exploratory analysis engineering, machine
- Data engineering learning (emerging)
- Rapid prototyping - Web applications
- Machine learning - Microservices
- Web applications
- Workflow integration
Types of Users - Data scientists - Data scientists Systems developers - Data scientists, data
- Data engineers - Data engineers engineers, machine learning
- Machine learning engineers (emerging)
engineers - Web app developers
- Web app developers - Systems developers
Learning Curve Accessible, easy to learn Easy for statisticians. Difficult, only Steep, can be learned by
Steep, can be learned by for professional programmers
programmers programmers
Data Science Ecosystem Robust, growing (PyPI) Robust, mature (CRAN) Lacks coverage Early, growing
Deployment Many deployment tools/ Difficult, requires Simple; Java Virtual Simple (compiled
integrations compiler Machine (JVM) ubiquity executable)
2
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
R programmers.
Developed by and for statisticians, R is best suited for Scala, which runs on the JVM, is increasingly being used
exploratory data analysis and visualization. Statisticians for machine learning and building high-level algorithms.
can use R to express their thoughts and ideas naturally However, like Java, it is not easily accessible to most data
without having a programming background. R is scientists due to its programmer-focused structure and
supported by a large and active community, and the lack of supporting libraries for data analysis.
CRAN repository contains thousands of packages and
readily usable tests to perform almost any type of data Go
analysis.
Recently, Go has emerged as an alternative to Python
Because R is designed mainly for standalone computing, and R as a solution to issues around deployment and
however, it is slower than Python and other languages, maintenance of data science code in production. This
and is limited to working with datasets small enough to is because Go promotes more efficient, better quality,
fit into memory. It also has a steep learning curve for error-free code that is easily integrated into a company’s
those who are not trained statisticians. Data scientists existing architecture.2 Popular packages like Jupyter
will often use R for desktop prototyping and then use a Notebooks have also now been extended to support Go.
more flexible language like Python or Java to deploy to
The main drawback to using Go for data science right
production.
now is that its ecosystem is underdeveloped compared
to Python and R, missing essential tools for arrays and
Java
visualization. As a relatively new language, Go is quickly
Known for its performance and scalability, Java is often gaining traction for microservices and web applications.
the preferred choice for enterprise infrastructure. It Keep your eye on this language for its potential
has a vast ecosystem and developer base, owing to its productivity gains in data science.
widespread enterprise adoption. As a compiled language,
it is generally faster than interpreted languages, but for DATA ANALYSIS WITH PYTHON
data science tasks, it is often slower than Python where
Aside from its flexibility and ease of use, Python’s
exclusive libraries optimize Python’s performance.
extensive libraries make it a powerful tool for data
Compared to Python and R, Java is the least suited for preparation, analysis and visualization compared to other
statistical analysis and visualization. Although there languages. Along with foundational packages NumPy
are packages to add some of these functions, they are (for multi-dimensional arrays), SciPy (library of numerical
not as supported as those available for Python or R. algorithms) and Matplotlib (plotting and visualization), the
Java’s highly object-oriented language structure also following libraries give Python enhanced productivity and
make it extremely difficult to learn for non-professional integration with big data sources.
2
Daniel Whitenack, “Data Science Gophers”, O’Reilly: https://www.oreilly.com/ideas/data-science-gophers 3
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
Complementing Your R Workforce With Python: Aside from stored data, one of the big challenges
rpy2 companies face is how to deal with huge amounts of
streaming data coming in from sources such as sensors,
Python and R are often used together for their
web feeds and market transactions. Analysis of streaming
complementary capabilities. Data scientists may use R
data can take a few forms. On one hand, companies
for its statistical functions and then wrap their model
can process streams of data and store them on disk
in a Python application that has a variety of additional
for analysis and reporting as needed. Alternatively,
features. Or they may use Python for analysis and call
companies may need to process and respond to data in
specialized packages only found in R. rpy2 provides an
real time (since that is when the data is most valuable).
interface to access R within Python and is helpful for
Examples of this include monitoring for service outages,
these use cases.
making website recommendations and doing real-time
For those familiar with R, they can use rpy2 to learn some price calculations.
Python while thinking about their problem in R terms,
For these purposes, Python connects to platforms that
and then express it in Python very easily. Those who are
handle real-time data feeds, such as Kafka. Kafka has a
not familiar with either can use rpy2 to learn about R
variety of use cases for managing high-volume activities.
and Python at the same time, gaining the power of both
For example, LinkedIn uses Kafka for activity stream
languages.3
data and operational metrics, while Netflix employs it for
However, since rpy2 is calling the R libraries underneath, real-time monitoring and event processing.4 Kafka is also
it is limited to working with data that fits into desktop or a key component in Kubernetes, since log aggregation
server memory. Analyzing large sets of data requires the is crucial when deploying thousands of application
use of frameworks such as Hadoop and HDF5, which are instances at scale.
able to bring in data as you need it and push it out as you
Python’s ability to integrate and pull data from disparate
don’t.
sources and formats, both static and streamed, makes
it extremely valuable as a means of generating return
Connecting to Big Data Platforms
on investment (ROI) on an organization’s big data
Python has all the libraries to connect to the various infrastructure.
types of databases most organizations now use.
This includes traditional SQL databases (i.e. mySQL,
Microsoft SQL, Oracle), NoSQL databases (i.e. MongoDB,
Cassandra, Redis), file systems (i.e. Hadoop), and
streaming data (i.e. Kafka).
3
Tom Radcliffe, “R vs. Python: A False Dichotomy”, ActiveState Blog: https://www.activestate.com/blog/2016/02/r-vs-python-false-dichotomy 4
4
Kafka website: https://kafka.apache.org/powered-by
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
›› Theano
›› Keras
›› Lasagne
›› NLKT
(natural language)
›› Intel MKL
(speed optimization)
Figure 1. Sample of Python’s tools and libraries for the data science workflow.
Preparing Messy, Missing and Unlabelled Data frames, previously unique to the R language, carry along
entity labels as rows and headers in a matrix format. This
Before any meaningful data analysis can take place, data
allows data scientists to focus on analysis work while
must be cleaned up and organized into a useable format.
the data frames automate the “bookkeeping” of the
Data preparation often involves labelling, filling in missing
metadata.5 In addition, Pandas provides features such
values and filtering outliers. Although it is essential to
as missing data estimation, adding and deleting columns
ensure accurate and reliable analysis, data preparation
and rows, and handling time series. Pandas is useful for
is considered a time consuming process, accounting for
those who who prefer to work in Python, but prefer the R
up to 80% of the work of data scientists. Python has a
syntax, and offers the advantage of being able to call out
number of packages that facilitate the process of data
to large datasets.
preparation, helping data scientists focus more time on
high value work. When it comes to working with data from various
sources, packages such as Luigi, Airflow and Dask help
One of these packages is Pandas, which brings the
with building data pipelines, managing workflow and
fundamental concept of data frames to Python. Data
5
Tom Radcliffe, “Pandas: Framing the Data”, ActiveState Blog: https://www.activestate.com/blog/2017/05/pandas-framing-data 5
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
scaling up and scaling out analytics. These tools handle enter “import Pandas”, create variables and get an
much of the complexity that arises as data science output without having to compile or run the code. The
teams grow, move faster and build on an evolving data availability of REPL within Python allows for instantaneous
infrastructure, allowing them to keep their momentum mathematical calculations, algorithm exploration and fast
on projects. prototyping.
More Minds are Better than One: Jupyter (Formerly MACHINE LEARNING WITH PYTHON
IPython) Notebook
The rise of big data has led to significant advances in
Part of what makes Python great for data science is that it artificial intelligence. Machine learning—the practice
supports exploratory and interactive programming. Data of using algorithms to train programs to not only
scientists can easily share their code and annotations recognize patterns in data, but to learn and take action
with colleagues, as well as experiment with code and see when exposed to new data—has existed for many
the results as they go along. years as an approach to AI. Recently, though, thanks
to the convergence of practically infinite data, storage,
To start with, IPython adds an interactive command shell
processing power and GPUs, we are now able to feed
to Python which provides a number of development
massive amounts of data through a system to train it.
enhancements. One of these enhancements is
This has enabled the development of complex, multi-
Jupyter Notebook, a browser-based tool for authoring
layered learning systems, called neural networks, creating
documents (notebooks) which combine code with
the field of “deep learning”.
explanatory text, mathematics, computations, diagrams
and other media. Data scientists can use notebooks to Deep learning has led to the growth of a number of AI
keep their analysis and observations in one place and capabilities, including image recognition (i.e. recognizing
share them with colleagues or the community. objects or patterns in photographs) and natural language
processing (i.e. summarizing text or recognizing speech).
Notebooks are useful as a way to troubleshoot problems,
Its applications are far-reaching, from automatic stock
since they allow others to easily see the steps one
trading and customer sentiment analysis to autonomous
went through to try to solve them. They have also
vehicles, optimized equipment usage, and new
become popular as a medium for knowledge sharing.
discoveries in healthcare, pharmaceuticals and other
Many notebooks for data science are published and
sciences.
maintained on Github.
Deep learning initiatives are primarily driven by open
The other aspect of Jupyter Notebook is that it provides
source languages. Machine learning workflows typically
a browser-based REPL (Read-Eval-Print Loop). This
involve pre-processing to clean up the data, learning
allows users to enter an expression and evaluate the
stages in which libraries of data are fed through a system
results immediately. For example, a data scientist can
to hone its pattern recognition, and testing of the results
6
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
on independent data—followed by deployment if the gage customer sentiment of their products or services,
tests are successful. as well as indicators of what is driving positive or negative
sentiment.
Python, in particular, is the most popular language
for machine and deep learning. Python packages like One of the most widely used NLP libraries is Python’s
Pandas and the Natural Learning Toolkit (NLTK) help Natural Language Toolkit (NLTK). NLTK provides essential
with the pre-processing. TensorFlow, Theano and Keras, functions such as tokenization (extracting key words and
as well as scikit-learn, provide the algorithms, additional phrases from content), stemming (i.e. grouping words like
libraries, computational power and user-friendly control happy, happiness and happier) and creating parse trees,
to develop the learning stages, and deployment is simply which are tree diagrams that reveal linguistic structure
a matter of packaging the model once it’s running well in and word dependencies. NLTK provides over 50
testing. corpora6—repositories of names, words and phrases—as
well as numerous algorithms and cookbooks, which
Here is a look at some of the major Python packages in
would all be exceedingly difficult to build in-house.
the machine learning workflow.
6
NLTK Documentation: http://www.nltk.org/
7
James Vincent, “Google has given its open-source machine learning software a big upgrade”, The Verge: 7
https://www.theverge.com/2016/4/13/11420144/google-machine-learning-tensorflow-upgrade
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
TensorFlow users include companies such as Airbnb, provides multi-threaded and vectorized functions that
Airbus, Snapchat and Qualcomm . One notable use case
8
maximize performance of NumPy, SciPy, Theano and
is UK online supermarket Ocado, which uses TensorFlow other computational libraries when running on Intel or
to route robots around its warehouses, improve compatible multi-core processors. Estimates range from
demand forecasting and recommend items to add to two to ten times speedups on individual workstations,
customers’ online shopping carts . Ocado built their
9
and much higher as more cores are applied to the
system in six months using Python, C++ and Kubernetes model10. Since Intel MKL utilizes C and Fortran, it is
with TensorFlow—a project initiated by weather storms compatible with many existing linear algebra libraries and
highlighting the need to prioritize emails in their contact offers Python performance comparable to C or C++.
centers based on their content rather than on a first-
come, first-serve basis. Going from Idea to Result Faster with Keras
8
TensorFlow website: https://www.tensorflow.org/
9
ComputerWorldUK, “What is TensorFlow, and how are businesses using it?”
http://www.computerworlduk.com/open-source/what-is-tensorflow-how-are-businesses-using-it-3658374/ 8
10
Performance benchmarks available on Intel MKL website: https://software.intel.com/en-us/mkl/features/benchmarks
11
Keras Documentation: https://keras.io/
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
RECOMMENDATIONS
The combination of flexibility and extensive libraries make Python the ideal language for data science and machine
learning. So how do you get started with Python for your data science initiatives? You can download the default
Python implementation (CPython), install the core packages for numerical and scientific computing—NumPy, SciPy
and Matplotlib—and start exploring. Or, to make life easier, you can try an alternative Python implementation such as
ActivePython with these libraries and many more pre-packaged.
In either case, implementing Python beyond a few internal machines or on production systems brings up a number
of considerations. These include which Python distribution to standardize on, setup and configuration time, staffing,
support and security requirements. Each of these factors depends on your organization’s specific needs.
One of Python’s great strengths is its open source Data scientists are notoriously hard to find and expensive
ecosystem. Python libraries are constantly emerging and to hire. As part of a multi-pronged approach to building
advancing based on the contributions of the community, data science teams, companies are retooling and
which drives more innovation than can be provided by training data scientists from within the organization.
any one company. While some commercial Python-based Implementing Python provides benefits for training and
platforms offer easy-to-use collaboration or visualization recruitment, such as employee growth opportunities,
tools, customers can find themselves locked into a faster results from ramping up data science efforts in
vendor-specific toolset. In contrast, the open source conjunction with training, and easier hiring of additional
Python universe gives data scientists the flexibility to grab staff.
the right tool for the job at any time and run with it.
Data analysts familiar with R can learn Python with
One example of open source innovation is TensorFlow, relative ease due to its low learning curve and
which has exploded in community contributions since frameworks like rpy2. Since Python can connect to all the
its release as an open source library by Google in 2015. data sources organizations use, data scientists-in-training
PyTables offers a separate example of open source can start mining big data for insights while learning on
synergy. Released in 2003 as a way to manage large the job.
amounts of data, PyTables has grown in tandem with
In addition, aligning with the open source Python
HDF5, which started much later as part of the big data
ecosystem allows organizations to recruit skilled staff
boom. Both examples demonstrate the value of open
from a larger candidate pool. By bringing in people who
source in evolving with constantly changing business
are already experienced with Python, organizations can
needs.
9
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
benefit from faster onboarding and consequently, faster Based on these factors, commercial support could be a
time to market. worthwhile or necessary investment.
Although open source Python offers a wide selection Licence compliance risks are surprisingly common in
of tools and libraries, setting up individual user commercial applications. According to a recent Black
environments can take a significant amount of time and Duck report, up to 85% of audited code bases were
resources. High-value staff can end up wasting days on found to be out of compliance with open source license
the low-value work of installing and configuring packages terms12, exposing organizations to potentially costly
before they are able to start writing algorithms. legal challenges. To address this problem, certain
commercial Python providers offer full license reviews
To solve this challenge, specialized Python distributions
of the packages included in their distributions, as well
come precompiled with the most popular open source
as legal indemnification to protect against potential IP
packages for data science, including the SciPy stack
infringement lawsuits arising from the use of third-party
and machine learning libraries. By using a precompiled
software.
distribution, data science and application development
teams can stay focused on productivity, rather than Often times, open source components are added directly
having to hack together and maintain all the components to code bases with security vulnerabilities present.
they need. According to Black Duck, more than 60% of audited
applications contained open-source vulnerabilities.
Technical Support With hundreds of open source packages in various
ecosystems, and organizations’ lack of oversight over
Solving technical issues for open source Python
these components, it is easy for data engineers or
implementations is a challenge. Aside from
data scientists to accidentally download vulnerabilities,
troubleshooting issues internally, organizations must
unbeknownst to their IT departments.
resort to posting issues on public forums such as
Stack Overflow, which can take days or weeks to get a Commercial Python distributions can provide greater
response, if they get one at all. This can be impractical for security, since the packages are generally reviewed and
time-sensitive or critical issues where downtime is not an maintained by the commercial provider. When using
option. a precompiled distribution, you can check with the
provider to ensure that all included packages are vetted
On top of that, many organizations are hesitant to reveal
for security vulnerabilities, that the latest secure versions
their intellectual property in a public forum, where
of packages are included, and that all packages are
questions on specific algorithms or machine learning
monitored for security updates on an ongoing basis.
packages could easily expose competitive advantages.
12
Madison Moore, SD Times, “Black Duck audit highlights risk of open-source security vulnerabilities”
http://sdtimes.com/black-duck-audit-highlights-risk-open-source-security-vulnerabilities/ 10
13
Gartner, “Gartner Says It’s Not Just About Big Data; It’s What You Do With It: Welcome to the Algorithmic Economy”
http://www.gartner.com/newsroom/id/3142917
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
As companies continue to invest in big data, the issue is becoming less about the data itself, but rather how the data
is used to create competitive products and services. According to Gartner, “Companies will be valued not just on their
big data, but on the algorithms that turn that data into actions and impact customers13”.
Python is the fundamental tool for this purpose, serving as a common language for the multi-disciplinary field of
data science. It allows data scientists to interrogate data from disparate sources, developers to turn those insights
into applications, and systems engineers to deploy on any infrastructure, whether on-premise or in the cloud. With
Python, companies are able to get the most ROI out of their existing investments in big data.
Companies are not only maximizing their use of data, but transforming into “algorithmic businesses” with Python as
the leading language for machine learning. Whether it’s automatic stock trading, discoveries of new drug treatments,
optimized resource production or any number of applications involving speech, text or image recognition, machine
and deep learning are becoming the primary competitive advantage in every industry.
The time is now for companies to get started on data science initiatives if they have not already. Introducing
Python into their technology stack is an important step, but companies should consider factors such as support
requirements, staffing plans, licensing compliance and security. By addressing these needs early on, data science
teams can focus on unlocking the power of their data and driving innovation forward.
ABOUT ACTIVEPYTHON
ActivePython is a leading Python distribution used by large enterprises, government and community developers. With
over 300 of the top open source packages included for data science, machine learning, web application and general
Python development, ActivePython delivers proven open source software with enterprise-level security and support.
ActivePython is made by ActiveState, a founding member of the Python Software Foundation, trusted by millions of
developers and 97% of Fortune-1000 companies.
Getting started with ActivePython for data science is easy. Your team can start writing algorithms for free with
Community Edition, and learn more about commercial options for use in production at www.activestate.com
11
Phone: +1.778.786.1100
Fax: +1.778.786.1133
ActiveState Software Inc.
business-solutions@activestate.com Toll-free in North America:
1.866.631.4581
ABOUT ACTIVESTATE
ActiveState, the Open Source Languages Company, believes that enterprises gain a competitive advantage when they are able to quickly create, deploy, and efficiently manage software solutions
that immediately create business value, but they face many challenges that prevent them from doing so. The Company is uniquely positioned to help address these challenges through our
experience with enterprises, people and technology. ActiveState is proven for the enterprise: More than two million developers and 97% of Fortune-1000 companies use ActiveState’s end-to-end
solutions to develop, distribute, and manage their software applications. Global customers like Bank of America, CA, Cisco, HP, Lockheed Martin and Siemens trust ActiveState to save time, save
money, minimize risk, ensure compliance, and reduce time to market.
© 2017 ActiveState Software Inc. All rights reserved. ActiveState®, ActivePerl®, ActiveTcl®, ActivePython®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActiveGo™, ActiveRuby™,
ActiveNode™, ActiveLua™ and The Open Source Languages Company™ are all trademarks of ActiveState.