R Vs Python For Data Science
R Vs Python For Data Science
In the battle of "best" data science tools, python and R both have their pros and cons.
Selecting one over the other will depend on the use-cases, the cost of learning, and other
common tools required.
By Martijn Theuwissen, DataCamp.
At DataCamp, our students often ask us whether they should use R and/or Python for their
day-to-day data analysis tasks. Although we mainly offer interactive R tutorials, we always
answer that this choice depends on the type of data analytical challenge that they are facing.
Both Python and R are popular programming languages for statistics. While Rs
functionality is developed with statisticians in mind (think of R's strong data visualization
capabilities!), Python is often praised for its easy-to-understand syntax.
In this post, we will highlight some of the differences between R and Python, and how they
both have a place in the data science and statistics world. If you prefer a visual
representation, make sure to check out the corresponding infographic Data Science Wars:
R vs Python.
Introducing R
Ross Ihaka and Robert Gentleman created the open-source language R in 1995 as an
implementation of the S programming language. The purpose was to develop a language
that focused on delivering a better and more user-friendly way to do data analysis, statistics
and graphical models. At first, R was primarily used in academics and research, but lately
the enterprise world is discovering R as well. This makes R one of the fastest growing
statistical languages in the corporate world.
One of the main strengths of R is its huge community that provides support through mailing
lists, user-contributed documentation and a very active Stack Overflow group. There is also
CRAN, a huge repository of curated R packages to which users can easily contribute.
These packages are a collection of R functions and data that make it easy to immediately
get access to the latest techniques and functionalities without needing to develop everything
from scratch yourself.
To end, if youre an experienced programmer, you probably won't have a hard time to get
up to speed with R. As a beginner, however, you might find yourself struggling with the
steep learning curve. Luckily, there are many great learning resources you can consult
nowadays.
Introducing Python
Python was created by Guido Van Rossem in 1991 and emphasizes productivity and code
readability. Programmers that want to delve into data analysis or apply statistical
techniques are some of the main users of Python for statistical purposes.
The closer you get to working in an engineering environment, the more likely it is you
might prefer Python. Its a flexible language that is great to do something novel, and given
its focus on readability and simplicity, its learning curve is relatively low.
Similar to R, Python has packages as well. PyPi is the Python Package index and consists
of libraries to which users can contribute. Just like R, Python has a great community but it
is a bit more scattered, since its a general purpose language. Nevertheless, Python for data
science is rapidly claiming a more dominant position in the Python universe: the
expectations are growing and more innovative data science applications will see their origin
here.
R and Python: The General Numbers
On the web, you can find many numbers comparing the adoption and popularity of R and
Python. While these figures often give a good indication on how these two languages are
evolving in the overall ecosystem of computer science, its hard to compare them side-byside. The main reason for this is that you will find R only in a data science environment;
As a general purpose language, Python, on the other hand, is widely used in many fields,
such as web development. This often biases the ranking results in favor of Python, while
the salaries are affected somewhat negatively.
You can use Python when your data analysis tasks need to be integrated with web apps or if
statistics code needs to be incorporated into a production database. Being a fully fledged
programming language, its a great tool to implement algorithms for production use.
While the infancy of Python packages for data analysis was an issue in the past, this has
improved significantly over the years. Make sure to install NumPy /SciPy (scientific
computing) and pandas (data manipulation) to make Python usable for data analysis. Also
have a look at matplotlib to make graphics, and scikit-learn for machine learning.
Unlike R, Python has no clear winning IDE. We recommend you to have a look at
Spyder, IPython Notebook and Rodeo to see which one best fits your needs.
R and Python: The Data Science Numbers
If you look at recent polls that focus on programming languages used for data analysis, R
often is a clear winner. If you focus specifically on Python and R's data analysis
community, a similar pattern appears.
Despite the above figures, there are signals that more people are switching from R to
Python. Furthermore, there is a growing group of individuals using a combination of both
languages when appropriate. This is exactly in line with what we recommend to our
students as well.
If youre planning to start a career in data science, you are good with both languages. Job
trends indicated an increasing demand for both skills, and wages are well above average.
R: Pros and Cons
Visualized data can often be understood more efficiently and effectively than the raw
numbers alone. R and visualization are a perfect match. Some must-see visualization
packages are ggplot2, ggvis, googleVis and rCharts.
Pro: R ecosystem
R has a rich ecosystem of cutting-edge packages and active community. Packages are
available at CRAN, BioConductor and Github. You can search through all R packages at
Rdocumentation.
Pro: R lingua franca of data science
R is developed by statisticians for statisticians. They can communicate ideas and concepts
through R code and packages, you dont necessarily need a computer science background
to get started. Furthermore, it is increasingly adopted outside of academia.
Pro/Con: R is slow
R was developed to make the life of statisticians easier, not the life of your computer.
Although R can be experienced as slow due to poorly written code, there are multiple
packages to improve Rs performance: pqR, renjin and FastR, Riposte and many more.
Con: R has a steep learning curve
Rs learning curve is non-trivial, especially if you come from a GUI for your statistical
analysis. Even finding packages can be time consuming if youre not familiar with it.
Python: Pros and Cons
Pro: IPython Notebook
The IPython Notebook makes it easier to work with Python and data. You can easily share
notebooks with colleagues, without having them to install anything. This drastically
reduces the overhead of organizing code, output and notes files. This will allow you to
spend more time doing real work.
Pro: A general purpose language
Python is a general purpose language that is easy and intuitive. This gives it a relatively flat
learning curve, and it increases the speed at which you can write a program. In short, you
need less time to code and you have more time to play around with it!
Visualizations are an important criteria when choosing data analysis software. Although
Python has some nice visualization libraries, such as Seaborn, Bokeh and Pygal, there are
maybe too many options to choose from. Moreover, compared to R, visualizations are
usually more convoluted, and the results are not always so pleasing to the eye.
Con: Python is a challenger