100% found this document useful (2 votes)

912 views

Guide Python Data Science

Uploaded by

foliox

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

912 views

Guide Python Data Science

Uploaded by

foliox

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

MANAGEMENT’S GUIDE

UNLOCKING THE POWER OF

DATA SCIENCE & MACHINE
LEARNING WITH PYTHON
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

INTRODUCTION: THE BIG DATA DILEMMA Designed as a flexible general purpose language, Python
is widely used by programmers and easily learnt by
In recent years, the amount of data available to
statisticians. Its extensive libraries make it a powerful tool
companies has skyrocketed. According to IBM, 2.5 billion
for statistical analysis, and it is routinely used to integrate
gigabytes (GB) of data are created every day.1 With this
models into web applications and production databases.
massive influx comes new opportunities for companies
Beyond conventional data analysis, Python is the leading
to deliver greater customer experiences and get an edge
language for machine learning, changing how businesses
on their competition. To this end, enterprises have been
are operating in every industry.
investing in big data platforms such as Hadoop, Spark
and NoSQL databases. The greater challenge, though, is This guide provides a summary of Python’s attributes
not only collecting and storing data, but being able to in each of these areas, as well as considerations
derive meaningful insights from it, and operationalizing for implementing Python to drive new insights and
those insights to create business value. innovation from big data.

Data science and machine learning have emerged

PYTHON VS. OTHER LANGUAGES
as the keys to unlocking this value. Unlike traditional
business analytics, which focus on known values and When it comes to which language is best for data science,
past performance, data science aims to identify hidden the short answer is that it depends on the work you are
patterns in order to drive new innovations. One of its trying to do. Python and R are suited for data science
main attributes is analyzing unstructured data, such as functions, while Java is the standard choice for integrating
speech, images and text, as well as streaming data—such data science code into large-scale systems. However,
as sensor data and online behavior—which can be Python challenges Java in that respect, and offers
processed and acted upon in real time. From there, additional value as a tool for building web applications.
machine learning takes data mining to the next level by Recently, Go has emerged as an up and coming
complementing human decisions with the ability to take alternative to the three major languages, but is not yet as
actions automatically after detecting patterns. well supported as Python.

Behind these efforts are the programming languages In practice, data science teams use a combination of
used by data science teams to clean up and prepare languages to play to the strengths of each one, with
data, write and test algorithms, build statistical Python and R used in varying degrees. Below is a brief
models, and translate into consumable applications or comparison table highlighting each language in the
visualizations. In this regard, Python stands out as the context of data science.
language best suited for all areas of the data science and
machine learning framework.

1
Matthew Wall, “Big Data: Are you ready for blast-off?”, BBC News: http://www.bbc.com/news/business-26383058 1
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

Python R Java Go

Core Strengths Easy to use, multi- Made for statistics, large High-performance, wide Modern architecture, clean
purpose, large community community enterprise adoption reliable code, lightweight,
fast

Ideal Use Cases - Data analysis - Data analysis Production systems - Production systems
- Visualization - Visualization - Data analysis, data
- Exploratory analysis - Exploratory analysis engineering, machine
- Data engineering learning (emerging)
- Rapid prototyping - Web applications
- Machine learning - Microservices
- Web applications
- Workflow integration

Types of Users - Data scientists - Data scientists Systems developers - Data scientists, data
- Data engineers - Data engineers engineers, machine learning
- Machine learning engineers (emerging)
engineers - Web app developers
- Web app developers - Systems developers

Learning Curve Accessible, easy to learn Easy for statisticians. Difficult, only Steep, can be learned by
Steep, can be learned by for professional programmers
programmers programmers

Data Science Ecosystem Robust, growing (PyPI) Robust, mature (CRAN) Lacks coverage Early, growing

Performance Reasonable, fast when Slow Fast Fast, concurrent

using optimized libraries
such as Intel

Deployment Many deployment tools/ Difficult, requires Simple; Java Virtual Simple (compiled
integrations compiler Machine (JVM) ubiquity executable)

Python also be accessed within Python using the rpy2 library,

giving users the best of both worlds. Python also
Python offers a strong combination of R’s data analysis
supports exploratory analytics via Jupyter Notebook
capabilities with general purpose speed and scalability.
(formerly IPython Notebook)–a web application for
Data scientists who have a basic understanding of code
sharing, explaining and iterating code.
can use Python effectively for day-to-day analyses, while
developers can use Python to migrate code off of data From a deployment standpoint, architectural approaches
scientists’ machines into applications or production such as containerization and microservices help ease
systems. the complexities of Python’s various dependencies. As
companies move towards microservices, they can use
Although the Python ecosystem is still catching up to R,
Python for the data-driven components of the system
Python goes beyond R in areas such as machine learning
while existing components written in Java or other
and natural language processing. R’s functionality can
languages continue to operate uninterrupted.

2
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

R programmers.

Developed by and for statisticians, R is best suited for Scala, which runs on the JVM, is increasingly being used
exploratory data analysis and visualization. Statisticians for machine learning and building high-level algorithms.
can use R to express their thoughts and ideas naturally However, like Java, it is not easily accessible to most data
without having a programming background. R is scientists due to its programmer-focused structure and
supported by a large and active community, and the lack of supporting libraries for data analysis.
CRAN repository contains thousands of packages and
readily usable tests to perform almost any type of data Go
analysis.
Recently, Go has emerged as an alternative to Python
Because R is designed mainly for standalone computing, and R as a solution to issues around deployment and
however, it is slower than Python and other languages, maintenance of data science code in production. This
and is limited to working with datasets small enough to is because Go promotes more efficient, better quality,
fit into memory. It also has a steep learning curve for error-free code that is easily integrated into a company’s
those who are not trained statisticians. Data scientists existing architecture.2 Popular packages like Jupyter
will often use R for desktop prototyping and then use a Notebooks have also now been extended to support Go.
more flexible language like Python or Java to deploy to
The main drawback to using Go for data science right
production.
now is that its ecosystem is underdeveloped compared
to Python and R, missing essential tools for arrays and
Java
visualization. As a relatively new language, Go is quickly
Known for its performance and scalability, Java is often gaining traction for microservices and web applications.
the preferred choice for enterprise infrastructure. It Keep your eye on this language for its potential
has a vast ecosystem and developer base, owing to its productivity gains in data science.
widespread enterprise adoption. As a compiled language,
it is generally faster than interpreted languages, but for DATA ANALYSIS WITH PYTHON
data science tasks, it is often slower than Python where
Aside from its flexibility and ease of use, Python’s
exclusive libraries optimize Python’s performance.
extensive libraries make it a powerful tool for data
Compared to Python and R, Java is the least suited for preparation, analysis and visualization compared to other
statistical analysis and visualization. Although there languages. Along with foundational packages NumPy
are packages to add some of these functions, they are (for multi-dimensional arrays), SciPy (library of numerical
not as supported as those available for Python or R. algorithms) and Matplotlib (plotting and visualization), the
Java’s highly object-oriented language structure also following libraries give Python enhanced productivity and
make it extremely difficult to learn for non-professional integration with big data sources.

2
Daniel Whitenack, “Data Science Gophers”, O’Reilly: https://www.oreilly.com/ideas/data-science-gophers 3
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

Complementing Your R Workforce With Python: Aside from stored data, one of the big challenges
rpy2 companies face is how to deal with huge amounts of
streaming data coming in from sources such as sensors,
Python and R are often used together for their
web feeds and market transactions. Analysis of streaming
complementary capabilities. Data scientists may use R
data can take a few forms. On one hand, companies
for its statistical functions and then wrap their model
can process streams of data and store them on disk
in a Python application that has a variety of additional
for analysis and reporting as needed. Alternatively,
features. Or they may use Python for analysis and call
companies may need to process and respond to data in
specialized packages only found in R. rpy2 provides an
real time (since that is when the data is most valuable).
interface to access R within Python and is helpful for
Examples of this include monitoring for service outages,
these use cases.
making website recommendations and doing real-time
For those familiar with R, they can use rpy2 to learn some price calculations.
Python while thinking about their problem in R terms,
For these purposes, Python connects to platforms that
and then express it in Python very easily. Those who are
handle real-time data feeds, such as Kafka. Kafka has a
not familiar with either can use rpy2 to learn about R
variety of use cases for managing high-volume activities.
and Python at the same time, gaining the power of both
For example, LinkedIn uses Kafka for activity stream
languages.3
data and operational metrics, while Netflix employs it for
However, since rpy2 is calling the R libraries underneath, real-time monitoring and event processing.4 Kafka is also
it is limited to working with data that fits into desktop or a key component in Kubernetes, since log aggregation
server memory. Analyzing large sets of data requires the is crucial when deploying thousands of application
use of frameworks such as Hadoop and HDF5, which are instances at scale.
able to bring in data as you need it and push it out as you
Python’s ability to integrate and pull data from disparate
don’t.
sources and formats, both static and streamed, makes
it extremely valuable as a means of generating return
Connecting to Big Data Platforms
on investment (ROI) on an organization’s big data
Python has all the libraries to connect to the various infrastructure.
types of databases most organizations now use.
This includes traditional SQL databases (i.e. mySQL,
Microsoft SQL, Oracle), NoSQL databases (i.e. MongoDB,
Cassandra, Redis), file systems (i.e. Hadoop), and
streaming data (i.e. Kafka).

3
Tom Radcliffe, “R vs. Python: A False Dichotomy”, ActiveState Blog: https://www.activestate.com/blog/2016/02/r-vs-python-false-dichotomy 4
4
Kafka website: https://kafka.apache.org/powered-by
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

Algorithm and Operationalize

Data Access Data Prep Data Visualization
Model Building Models

›› Hadoop ›› Pandas Fundamentals ›› Matplotlib ›› Build into web app /

(data prep) (visualization, plotting) (micro) services / APIs
›› SQL Server ›› NumPy
›› Luigi (batch jobs) ›› scikit-image ›› Integrate into cloud
›› mySQL ›› SciPy (image processing) services
›› Dask (AWS, Google)
›› MongoDB (parallel computing) ›› SymPy ›› Bokeh
(visualization in ›› Deploy on production
›› Redis ›› Airflow ›› scikit-learn modern browsers) systems
(batch, monitoring)
›› HDF5 Machine/Deep Learning ›› Seaborn
(statistical
›› TensorFlow visualization)

›› Theano
›› Keras
›› Lasagne
›› NLKT
(natural language)

›› Intel MKL
(speed optimization)

Figure 1. Sample of Python’s tools and libraries for the data science workflow.

Preparing Messy, Missing and Unlabelled Data frames, previously unique to the R language, carry along
entity labels as rows and headers in a matrix format. This
Before any meaningful data analysis can take place, data
allows data scientists to focus on analysis work while
must be cleaned up and organized into a useable format.
the data frames automate the “bookkeeping” of the
Data preparation often involves labelling, filling in missing
metadata.5 In addition, Pandas provides features such
values and filtering outliers. Although it is essential to
as missing data estimation, adding and deleting columns
ensure accurate and reliable analysis, data preparation
and rows, and handling time series. Pandas is useful for
is considered a time consuming process, accounting for
those who who prefer to work in Python, but prefer the R
up to 80% of the work of data scientists. Python has a
syntax, and offers the advantage of being able to call out
number of packages that facilitate the process of data
to large datasets.
preparation, helping data scientists focus more time on
high value work. When it comes to working with data from various
sources, packages such as Luigi, Airflow and Dask help
One of these packages is Pandas, which brings the
with building data pipelines, managing workflow and
fundamental concept of data frames to Python. Data

5
Tom Radcliffe, “Pandas: Framing the Data”, ActiveState Blog: https://www.activestate.com/blog/2017/05/pandas-framing-data 5
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

scaling up and scaling out analytics. These tools handle enter “import Pandas”, create variables and get an
much of the complexity that arises as data science output without having to compile or run the code. The
teams grow, move faster and build on an evolving data availability of REPL within Python allows for instantaneous
infrastructure, allowing them to keep their momentum mathematical calculations, algorithm exploration and fast
on projects. prototyping.

More Minds are Better than One: Jupyter (Formerly MACHINE LEARNING WITH PYTHON
IPython) Notebook
The rise of big data has led to significant advances in
Part of what makes Python great for data science is that it artificial intelligence. Machine learning—the practice
supports exploratory and interactive programming. Data of using algorithms to train programs to not only
scientists can easily share their code and annotations recognize patterns in data, but to learn and take action
with colleagues, as well as experiment with code and see when exposed to new data—has existed for many
the results as they go along. years as an approach to AI. Recently, though, thanks
to the convergence of practically infinite data, storage,
To start with, IPython adds an interactive command shell
processing power and GPUs, we are now able to feed
to Python which provides a number of development
massive amounts of data through a system to train it.
enhancements. One of these enhancements is
This has enabled the development of complex, multi-
Jupyter Notebook, a browser-based tool for authoring
layered learning systems, called neural networks, creating
documents (notebooks) which combine code with
the field of “deep learning”.
explanatory text, mathematics, computations, diagrams
and other media. Data scientists can use notebooks to Deep learning has led to the growth of a number of AI
keep their analysis and observations in one place and capabilities, including image recognition (i.e. recognizing
share them with colleagues or the community. objects or patterns in photographs) and natural language
processing (i.e. summarizing text or recognizing speech).
Notebooks are useful as a way to troubleshoot problems,
Its applications are far-reaching, from automatic stock
since they allow others to easily see the steps one
trading and customer sentiment analysis to autonomous
went through to try to solve them. They have also
vehicles, optimized equipment usage, and new
become popular as a medium for knowledge sharing.
discoveries in healthcare, pharmaceuticals and other
Many notebooks for data science are published and
sciences.
maintained on Github.
Deep learning initiatives are primarily driven by open
The other aspect of Jupyter Notebook is that it provides
source languages. Machine learning workflows typically
a browser-based REPL (Read-Eval-Print Loop). This
involve pre-processing to clean up the data, learning
allows users to enter an expression and evaluate the
stages in which libraries of data are fed through a system
results immediately. For example, a data scientist can
to hone its pattern recognition, and testing of the results

6
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

on independent data—followed by deployment if the gage customer sentiment of their products or services,
tests are successful. as well as indicators of what is driving positive or negative
sentiment.
Python, in particular, is the most popular language
for machine and deep learning. Python packages like One of the most widely used NLP libraries is Python’s
Pandas and the Natural Learning Toolkit (NLTK) help Natural Language Toolkit (NLTK). NLTK provides essential
with the pre-processing. TensorFlow, Theano and Keras, functions such as tokenization (extracting key words and
as well as scikit-learn, provide the algorithms, additional phrases from content), stemming (i.e. grouping words like
libraries, computational power and user-friendly control happy, happiness and happier) and creating parse trees,
to develop the learning stages, and deployment is simply which are tree diagrams that reveal linguistic structure
a matter of packaging the model once it’s running well in and word dependencies. NLTK provides over 50
testing. corpora6—repositories of names, words and phrases—as
well as numerous algorithms and cookbooks, which
Here is a look at some of the major Python packages in
would all be exceedingly difficult to build in-house.
the machine learning workflow.

Training AI Engines, Google Style: TensorFlow

Processing and Analyzing Human Language: NLTK
Created by Google, TensorFlow is the most popular
Critical to the discussion of machine and deep learning is
foundational library for building deep learning models. It
the ability to analyze unstructured data, such as natural
provides the framework to train complex neural networks
language. While understanding language makes up the
and is the AI engine that powers many of Google’s
majority of human activities, the ability to process human
operations, such as search ranking, image classification,
language into meaningful information is an incredibly
drug discovery and NLP services like language translation
difficult task for machines. Nuances such as context,
and voice recognition. It is also open source, allowing
slang, misspellings and phrasing do not easily fit into a
enterprises to leverage the power created by machine
pre-defined database format. With the help of machine
learning researchers at Google to build their own neural
learning, natural language processing (NLP) is able to
networks with their own data.
break language down into digestible units.
Since its introduction two years ago, TensorFlow has
NLP applications are everywhere. Common examples are
been a sensation. It was the “most forked” project on
voice recognition software, search auto-completions and
GitHub in 20157 and currently has been forked over
customer service chatbots. One important NLP function
28,000 times and has over 60,000 stars. TensorFlow is
is summarization, the ability to summarize text such as
also closely aligned with Python, having been built on
news articles or research papers into executive briefs.
Python and C++. Although its API is available for C++,
Another example is sentiment analysis. By aggregating
Haskell, Java and Go, it is primarily intended for Python as
and summarizing social media posts, organizations can
a bridge to the underlying C++ engine.

6
NLTK Documentation: http://www.nltk.org/
7
James Vincent, “Google has given its open-source machine learning software a big upgrade”, The Verge: 7
https://www.theverge.com/2016/4/13/11420144/google-machine-learning-tensorflow-upgrade
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

TensorFlow users include companies such as Airbnb, provides multi-threaded and vectorized functions that
Airbus, Snapchat and Qualcomm . One notable use case
8
maximize performance of NumPy, SciPy, Theano and
is UK online supermarket Ocado, which uses TensorFlow other computational libraries when running on Intel or
to route robots around its warehouses, improve compatible multi-core processors. Estimates range from
demand forecasting and recommend items to add to two to ten times speedups on individual workstations,
customers’ online shopping carts . Ocado built their
9
and much higher as more cores are applied to the
system in six months using Python, C++ and Kubernetes model10. Since Intel MKL utilizes C and Fortran, it is
with TensorFlow—a project initiated by weather storms compatible with many existing linear algebra libraries and
highlighting the need to prioritize emails in their contact offers Python performance comparable to C or C++.
centers based on their content rather than on a first-
come, first-serve basis. Going from Idea to Result Faster with Keras

Keras is a user-friendly layer that can be used over top of

Enabling Faster, Large-Scale Computational
either TensorFlow or Theano, both of which can be a little
Processing
low-level for many deep learning use cases. It provides
Machine and deep learning require a great deal of an easier way to build deep learning models, and is
processing power. Theano is a Python library that designed with the belief, “Being able to go from idea to
enables fast numerical computation by optimizing result with the least possible delay is key to doing good
mathematical expressions. It is effectively a compiler that research11.”
takes your structures and turns them into highly efficient
While Theano and TensorFlow give fine-grained control
code using multi-array (NumPy-like) syntax, native C
over their underlying learning engines, Keras makes it
code and a host of other optimizations, getting as much
easy to rapidly explore ideas. Keras might not be what
performance as possible out of CPUs and GPUs.
you would use in production, but for exploratory deep
Theano is written in Python and has a Python interface. It learning it makes it extremely easy to get started quickly
was developed in 2007 at the University of Montreal, and and do a rapid evaluation of different approaches to a
was one the first libraries of its kind, and is considered an problem. Just because Keras is user-friendly does not
industry standard for the advancement of deep learning. mean it’s simple, though: it supports advanced machine
learning algorithms like recurrent neural networks and
The Intel® Machine Kernel Library (MKL) provides
noise layers.
another set of optimization libraries for math processing
and neural network routines. Developed for science,
engineering and financial applications, Intel MKL

8
TensorFlow website: https://www.tensorflow.org/
9
ComputerWorldUK, “What is TensorFlow, and how are businesses using it?”
http://www.computerworlduk.com/open-source/what-is-tensorflow-how-are-businesses-using-it-3658374/ 8
10
Performance benchmarks available on Intel MKL website: https://software.intel.com/en-us/mkl/features/benchmarks
11
Keras Documentation: https://keras.io/
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

RECOMMENDATIONS

The combination of flexibility and extensive libraries make Python the ideal language for data science and machine
learning. So how do you get started with Python for your data science initiatives? You can download the default
Python implementation (CPython), install the core packages for numerical and scientific computing—NumPy, SciPy
and Matplotlib—and start exploring. Or, to make life easier, you can try an alternative Python implementation such as
ActivePython with these libraries and many more pre-packaged.

In either case, implementing Python beyond a few internal machines or on production systems brings up a number
of considerations. These include which Python distribution to standardize on, setup and configuration time, staffing,
support and security requirements. Each of these factors depends on your organization’s specific needs.

Here is a look at these key considerations.

Open Source vs. Commercial Tools Staffing

One of Python’s great strengths is its open source Data scientists are notoriously hard to find and expensive
ecosystem. Python libraries are constantly emerging and to hire. As part of a multi-pronged approach to building
advancing based on the contributions of the community, data science teams, companies are retooling and
which drives more innovation than can be provided by training data scientists from within the organization.
any one company. While some commercial Python-based Implementing Python provides benefits for training and
platforms offer easy-to-use collaboration or visualization recruitment, such as employee growth opportunities,
tools, customers can find themselves locked into a faster results from ramping up data science efforts in
vendor-specific toolset. In contrast, the open source conjunction with training, and easier hiring of additional
Python universe gives data scientists the flexibility to grab staff.
the right tool for the job at any time and run with it.
Data analysts familiar with R can learn Python with
One example of open source innovation is TensorFlow, relative ease due to its low learning curve and
which has exploded in community contributions since frameworks like rpy2. Since Python can connect to all the
its release as an open source library by Google in 2015. data sources organizations use, data scientists-in-training
PyTables offers a separate example of open source can start mining big data for insights while learning on
synergy. Released in 2003 as a way to manage large the job.
amounts of data, PyTables has grown in tandem with
In addition, aligning with the open source Python
HDF5, which started much later as part of the big data
ecosystem allows organizations to recruit skilled staff
boom. Both examples demonstrate the value of open
from a larger candidate pool. By bringing in people who
source in evolving with constantly changing business
are already experienced with Python, organizations can
needs.

9
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

benefit from faster onboarding and consequently, faster Based on these factors, commercial support could be a
time to market. worthwhile or necessary investment.

Getting Started Licensing and Security

Although open source Python offers a wide selection Licence compliance risks are surprisingly common in
of tools and libraries, setting up individual user commercial applications. According to a recent Black
environments can take a significant amount of time and Duck report, up to 85% of audited code bases were
resources. High-value staff can end up wasting days on found to be out of compliance with open source license
the low-value work of installing and configuring packages terms12, exposing organizations to potentially costly
before they are able to start writing algorithms. legal challenges. To address this problem, certain
commercial Python providers offer full license reviews
To solve this challenge, specialized Python distributions
of the packages included in their distributions, as well
come precompiled with the most popular open source
as legal indemnification to protect against potential IP
packages for data science, including the SciPy stack
infringement lawsuits arising from the use of third-party
and machine learning libraries. By using a precompiled
software.
distribution, data science and application development
teams can stay focused on productivity, rather than Often times, open source components are added directly
having to hack together and maintain all the components to code bases with security vulnerabilities present.
they need. According to Black Duck, more than 60% of audited
applications contained open-source vulnerabilities.
Technical Support With hundreds of open source packages in various
ecosystems, and organizations’ lack of oversight over
Solving technical issues for open source Python
these components, it is easy for data engineers or
implementations is a challenge. Aside from
data scientists to accidentally download vulnerabilities,
troubleshooting issues internally, organizations must
unbeknownst to their IT departments.
resort to posting issues on public forums such as
Stack Overflow, which can take days or weeks to get a Commercial Python distributions can provide greater
response, if they get one at all. This can be impractical for security, since the packages are generally reviewed and
time-sensitive or critical issues where downtime is not an maintained by the commercial provider. When using
option. a precompiled distribution, you can check with the
provider to ensure that all included packages are vetted
On top of that, many organizations are hesitant to reveal
for security vulnerabilities, that the latest secure versions
their intellectual property in a public forum, where
of packages are included, and that all packages are
questions on specific algorithms or machine learning
monitored for security updates on an ongoing basis.
packages could easily expose competitive advantages.

12
Madison Moore, SD Times, “Black Duck audit highlights risk of open-source security vulnerabilities”
http://sdtimes.com/black-duck-audit-highlights-risk-open-source-security-vulnerabilities/ 10
13
Gartner, “Gartner Says It’s Not Just About Big Data; It’s What You Do With It: Welcome to the Algorithmic Economy”
http://www.gartner.com/newsroom/id/3142917
UNLOCKING THE POWER OF
DATA SCIENCE & MACHINE
LEARNING WITH PYTHON

CONCLUSION: BECOMING AN ALGORITHMIC BUSINESS

As companies continue to invest in big data, the issue is becoming less about the data itself, but rather how the data
is used to create competitive products and services. According to Gartner, “Companies will be valued not just on their
big data, but on the algorithms that turn that data into actions and impact customers13”.

Python is the fundamental tool for this purpose, serving as a common language for the multi-disciplinary field of
data science. It allows data scientists to interrogate data from disparate sources, developers to turn those insights
into applications, and systems engineers to deploy on any infrastructure, whether on-premise or in the cloud. With
Python, companies are able to get the most ROI out of their existing investments in big data.

Companies are not only maximizing their use of data, but transforming into “algorithmic businesses” with Python as
the leading language for machine learning. Whether it’s automatic stock trading, discoveries of new drug treatments,
optimized resource production or any number of applications involving speech, text or image recognition, machine
and deep learning are becoming the primary competitive advantage in every industry.

The time is now for companies to get started on data science initiatives if they have not already. Introducing
Python into their technology stack is an important step, but companies should consider factors such as support
requirements, staffing plans, licensing compliance and security. By addressing these needs early on, data science
teams can focus on unlocking the power of their data and driving innovation forward.

ABOUT ACTIVEPYTHON

ActivePython is a leading Python distribution used by large enterprises, government and community developers. With
over 300 of the top open source packages included for data science, machine learning, web application and general
Python development, ActivePython delivers proven open source software with enterprise-level security and support.
ActivePython is made by ActiveState, a founding member of the Python Software Foundation, trusted by millions of
developers and 97% of Fortune-1000 companies.

Getting started with ActivePython for data science is easy. Your team can start writing algorithms for free with
Community Edition, and learn more about commercial options for use in production at www.activestate.com

11
Phone: +1.778.786.1100
Fax: +1.778.786.1133
ActiveState Software Inc.
business-solutions@activestate.com Toll-free in North America:
1.866.631.4581

ABOUT ACTIVESTATE
ActiveState, the Open Source Languages Company, believes that enterprises gain a competitive advantage when they are able to quickly create, deploy, and efficiently manage software solutions
that immediately create business value, but they face many challenges that prevent them from doing so. The Company is uniquely positioned to help address these challenges through our
experience with enterprises, people and technology. ActiveState is proven for the enterprise: More than two million developers and 97% of Fortune-1000 companies use ActiveState’s end-to-end
solutions to develop, distribute, and manage their software applications. Global customers like Bank of America, CA, Cisco, HP, Lockheed Martin and Siemens trust ActiveState to save time, save
money, minimize risk, ensure compliance, and reduce time to market.

© 2017 ActiveState Software Inc. All rights reserved. ActiveState®, ActivePerl®, ActiveTcl®, ActivePython®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActiveGo™, ActiveRuby™,
ActiveNode™, ActiveLua™ and The Open Source Languages Company™ are all trademarks of ActiveState.