Python Extra Tutorial
Python Extra Tutorial
1. Type your Python command! It can be a multi-line command too – if you hit return/enter, it
won’t run, it will just start a new line in the same cell!
3. Start typing and hit TAB! If it’s possible, Jupyter will auto-complete your expression (eg. for
Python commands or for variables that you have already defined). If there is more than one
possibility, you can choose from a drop-down menu.
Python Basics
Great! You have everything from the technical side to start coding in Python! Now this tutorial will start
off with the base concepts that you must learn before we go into how to use Python for Data Science.
The six base concepts will be:
1. Variables and data types
2. Data Structures in Python
3. Functions and methods
4. If statements
5. Loops
6. Python syntax essentials
1
Python Basics 1: Variables and Data types
In Python we like to assign values to variables. Why? Because it makes our code better — more flexible,
reusable and understandable. At the same time one of the trickiest things in coding is exactly this
“assignment concept.” When we refer to something, that refers to something, that refers to something…
well, understanding that needs some brain capacity. But don’t you worry, you will get used to it – and you
will love it!
Let’s see how it works!
Say we have a dog (‘Freddie’), and we would like to store some of his attributes (name, age,
is_vaccinated, year_of_born, etc.) of this dog in Python variables! We will type this into a Jupyter
notebook cell:
dog_name = 'Freddie'
age = 9
is_vaccinated = True
height = 1.1
birth_year = 2001
Note: we could have done this one per cell. But this all-in-one solution was easier and more elegant.
From now on, if we type these variables, the assigned values will be returned:
2
age 9 int (short for integer)
There are many more data types, but as a start, knowing these four will good enough and the rest will
come along the way.
It’s important to know that in Python every variable is overwritable. Eg. if we now run:
dog_name = 'Eddie'
in our Jupyter Notebook, our dog won’t be Freddie any more…
What we can do with a and b? Well, first of all, a bunch of basic arithmetic operations! It’s nothing
special, you could have found out these by common sense, but just in case, here’s the list:
3
a+b Adds a to b 7
a*b Multiply a by b 12
Note: try it for yourself with your values in your Jupyter Notebook! It’s fun!
We can use some variables with comparison operators. The results will always be Boolean values!
(Remember? Booleans can be only True or False.) a and b are still 3 and 4.
4
Operator What does it do? Result in our example
In the notebook:
5
This is easy and maybe less exciting, but again: just start to type this into your notebook, run your
commands and start to combine things – and it’s gonna be much more fun!
Speaking of which! Spice things up with some exercises!
Test yourself #1
Here are some new variables:
a=1
b=2
c=3
d = True
e = 'cool'
What will be the returned data type and the exact result of this operation?
a == e or d and c > b
Note: First try to find it out without typing it into Python – then check if you have guessed right!
.
.
.
The answer is: it’s gonna be a Boolean and it will be True.
Why? Because:
a == e is False – as 1 is not equal to ‘cool’
d is True by definition
c > b is True, because 3 is greater than 2
So a == e or d and c>b translated is: False or True and True, which is True.
Test yourself #2
Use the variables from the previous assignment:
6
a=1
b=2
c=3
d = True
e = 'cool'
But this time try to figure out the result of this slightly modified expression:
not a == e or d and not c > b
Uh-oh, wait a minute! There is a trick here! To give a proper answer you have to know one more rule!
The evaluation order of the logical operators is: 1. not 2. and 3. or
.
.
.
Here’s the solution: True.
Why?
Let’s see! Using the previous exercise’s logic, this is what we have:
not False or True and not True
As we have discussed, the first logical operator evaluated is the not. After firing all the nots, this is what
we have:
Conclusion
Done with episode 1!
Did you realize that you have just started to code in Python 3? Wasn’t it easy and fun?
Well, good news: the rest of Python is just as easy as this was. The difficulty will come from the
combination of these simple things… But that’s why learning the basics very well is so important!
So stay with me – in the next chapter of “Python for Data Science” I’ll introduce the most important Data
Structures in Python!
If you want to learn more about how to become a data scientist, take my 50-minute video
course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.
Cheers,
Tomi Mester
7
Python Data Structures (Python for Data Science Basics #2)
Why care about Python Data Structures?
Imagine that you have a book on your desk. I have one on mine: P. & A. Bruce: Practical Statistics
for Data Scientists. If I want to store this info in Python, I can put it into a variable.
8
my_book = "Practical Statistics for Data Scientists"
Done!
But hey, I just missed two more books on the other side of my desk! Dan Brown: Digital
Fortress and George R. R. Martin: A Game of Thrones. How do I store these two new pieces of
information… Maybe I can set up two new variables:
my_book2 = "Digital Fortress"
my_book3 = "A Game of Thrones"
Wait a minute! I’ve just realized I have a whole bookshelf behind me…
Are you seeing the problem? Sometimes in Python we need to store relevant information
together in one object – instead of several small variables.
This is why we have Data Structures!
9
Python Data Structures #1: List
It’s important to know that in Python, a list is an object – and generally speaking it’s treated
like any other data type (e.g. integers, strings, booleans, etc.). This means that you can assign
your list to a variable, so you can store and make it easier to access:
my_first_list = [3, 4, 1, 4, 5, 2, 7]
my_first_list
A list can hold every other type of data, not just integers – strings, Booleans, even other lists.
Interesting, huh? Do you remember Freddie, the dog from the previous article?
You can store his attributes in one list instead of 5 different variables:
dog = ['Freddie', 9, True, 1.1, 2001]
Now let’s say that Freddie has two belongings: a bone and a little ball. We can store those
belongings as a list inside our first list.
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
Actually we can do this list-in-a-list thingy infinite times – and believe it or not, this simple
concept (the official name is “nested lists,” by the way) will be essential when it comes to the
actual Data Science part of Python – e.g. when we create some multidimensional numpy arrays to
run correlation analyses… but let’s not get into it yet! The only thing you should remember is that
you can store lists in lists.
Or try this:
sample_matrix = [[1, 4, 9], [1, 8, 27], [1, 16, 81]]
10
Do you feel scientific? You should, because you have just created a 3-by-3 2D matrix.
How to access a specific element of a Python list?
Now that we have stored these values, it’s really essential to know how to access them in the
future. As you have already seen, you can get the whole Python list returned if you type the
right variable name.
E.g.
dog
But how do you call one particular item from your list? Firstly, think a bit about how you can
refer to a value in theory… The only thing that comes into play is the position of the value. E.g. if
you want to call the first element on the dog list, you have to type the name of the list and the
number of the element between brackets, like this: [1]. Try this:
dog[1]
What??? 9 was the second element on the list, not the first. Well, not in Python… Python uses
so-called “zero-based indexing”, which means that the first element’s number is [0], the second
is [1], the third is [2] and so on. This is something you have to keep in mind, when working with
Python Data Structures.
Note: “But why is that?” Hmm, tough topic! I don’t dare to say, “because of nerds…” So instead, I’ll just
link this nice open letter by Prof Dijkstra from
1982: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html
Anyway, don’t think too much about this… Just accept and apply this strange rule!
But here’s a detailed example!
Freddie the dog:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
11
Try to print all the list elements one by one:
dog[0]
dog[1]
dog[2]
dog[3]
dog[4]
dog[5]
12
If this is not 100% clear yet, I suggest playing around a bit with the sample_matrix = [[1, 4, 9], [1,
8, 27], [1, 16, 81]] data set and you will learn the trick!
What is a Python tuple? First of all: as a junior/aspiring Data Scientist, you don’t have to care too
much about tuples. If you want you can even skip this section.
If you have stayed:
A Python tuple is almost the same as a Python list, with a few small differences.
1. Syntax-wise: when you set up a tuple, you won’t use brackets, but parentheses.
List:
book_list = ['A Game of Thrones', 'Digital Fortress', 'Practical Statistics for Data
Scientists']
Tuple:
book_tuple = ('A Game of Thrones', 'Digital Fortress', 'Practical Statistics for Data
Scientists')
2. A Python list is mutable – so you can add, remove and change items in it. On the other
hand, a Python tuple is immutable, so once it’s set up, it’s sort of “set in stone.” This
strictness can be handy in some cases to make your code safer.
3. Python tuples are slightly faster than Python lists with the same calculations.
Other than that, you can use a tuple pretty much the same way as a list. Even returning an item
happens via the same bracket frames method. (Try: book_tuple[1] for your freshly created tuple.)
Again: none of the above will be your concern when you just start off with Python coding, but
it’s good to know at least this bit about tuples.
13
Python Data Structures #3: Dictionaries
Dictionaries are a whole different story. They are actually very different from lists – and very
commonly applied and useful in data science projects.
The main concept of dictionaries is that for every value you have a unique key. Take a look at
Freddie the dog again:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
These are the values that we want to store about a dog. In a dictionary you can attribute a key
for each of these values, so you can understand better that what value stands for what.
dog_dict = {'name': 'Freddie', 'age': 9, 'is_vaccinated': True, 'height': 1.1, 'birth_year': 2001,
'belongings': ['bone', 'little ball']}
As you can see the output is already formatted by Python… For better understanding you can do
the same for yourself on the first hand – let’s put the key-value pairs into new lines:
dog_dict = {'name': 'Freddie',
'age': 9,
'is_vaccinated': True,
'height': 1.1,
'birth_year': 2001,
'belongings': ['bone', 'little ball']}
As you can see, a nested list (belongings in this example) as a value in a dictionary is not a
problem.
And this is how a Python dictionary looks!
dog_dict['name']
14
Note 1: maybe you are wondering whether you can still use a number to call a dictionary value. It’s not
possible, because Python dictionaries are unordered by definition – this means none of the key-value
pairs have a number in the dictionary. You can check this right away if you put in a new dictionary –
when you call it back to your screen, it’ll change your original order to alphabetical order. (Check
above!)
Note 2: maybe you are also wondering if you can return a key by inputting a value and not just a value
by inputting a key. Bad news: it’s not possible – Python is simply not made for that.
Test yourself!
Tadaaa! End of the article! It’s time to test yourself! You have learned a lot of important new
things about Python Data Structures today. If you haven’t been doing the “Test yourself”
sections in my articles, please make an exception for this one. Python Data Structures are
something that you will use all the time when you work as a Data Scientist, so do yourself a favor
15
test[0]['Arizona'] –» This is basically the next step of exercise #2 – we are
calling the 'Phoenix' value with its key: 'Arizona'.
EXERCISE #5
test[4][2] –» And this one is related to exercise #3 – referring to'jeans' by its number – don’t
forget the zero-based indexing.
EXERCISE #6
test[4][3]['socks2'] –» And one more step – calling the item of a dictionary of a nested list
within a list – by its key: 'socks2'.
Conclusion
Nice. Job. You are done with another Python tutorial! This is almost everything you have to
know about Python Data Structures! Well, in fact, there will be a lot of small, but important
details (e.g. how to add, remove and change elements in a list or in a dictionary)… but to get
there we need to talk a little bit about Python functions and methods, and some other exciting
things first! Continue here:
Python Functions and Methods
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
16
Python Built-in Functions and Methods (Python for Data
Science Basics #3)
What are Python functions and methods?
Let’s start with the basics. Say we have a variable:
a = 'Hello!'
Here’s a simple example of a Python function:
len(a)
Result: 6
And an example for a Python method:
a.upper()
Result: 'HELLO!'
So what are Python functions and methods? In essence they transform something into
something else. In this case the input was 'Hello!' and the output was the length of this string (6),
and then the capitalized version: 'HELLO!'. Of course, these are not the only 2 functions you can
use: there are plenty of them. Combining them will help you in every part of your data project –
from data cleaning to machine learning. Everything.
17
print()
We have already used print(). It prints your stuff to the screen.
Example: print("Hello, World!")
abs()
returns the absolute value of a numeric value (e.g. integer or float). Obviously it can’t be a string.
It has to be a numeric value.
Example: abs(-4/3)
round()
returns the rounded value of a numeric value.
Example: round(-4/3)
min()
returns the smallest item of a list or of the typed-in arguments. It can even be a string.
Example 1: min(3,2,5)
Example 2: min('c','a','b')
max()
18
sum()
It sums a list. The list can have all types of numeric values, although it handles floats… well, not
smartly.
Example1:
a = [3, 2, 1]
sum(a)
Example1:
b = [4/3, 2/3, 1/3, 1/3, 1/3]
sum(b)
len()
returns the number of elements in a list or the number of characters in a string.
Example: len('Hello!')
type()
returns the type of the variable.
Example 1:
a = True
type(a)
Example 2:
b=2
type(b)
19
These are the built-in Python functions that you will use quite regularly. If you want to see all of
them, here’s the full list: https://docs.python.org/3/library/functions.html
But I’ll also show you more in my upcoming tutorials.
Python:
a.lower()
returns the lowercase version of a string.
Example:
a = 'MuG'
a.lower()
a.upper()
the opposite of lower()
a.strip()
if the string has whitespaces at the beginning or at the end, it removes them.
Example:
a = ' Mug '
20
a.strip()
a.replace('old', 'new')
replaces a given string with another string. Note that it’s case sensitive.
Example:
a = 'muh'
a.replace('h','g')
a.split('delimiter')
splits your string into a list. Your argument specifies the delimiter.
Example:
a = 'Hello World'
a.split(' ')
Note: in this case the space is the delimiter.
'delimiter'.join(a)
It joins elements of a list into one string. You can specify the delimiter again.
Example:
a = ['Hello', 'World']
' '.join(a)
21
Methods for Python Lists
Do you remember the last article, when we went through the Python data structures? Let’s talk a
little bit about them again. Last time we discussed how to create a list and how to access its
elements. But I haven’t told you about how to modify a list. Any tips? Yes, you will need the
Python List methods!
Let’s bring back our favorite Python Dog, Freddie:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
Let’s see how we can modify this list!
a.append(arg)
The .append() method adds an element to the end of our list. In this case, let’s say we want to
add the number of legs Freddie has (which is 4).
Example:
dog.append(4)
dog
a.remove(arg)
If we want to remove the birth year, we can do it using the .remove() method. We have to
specify the element that we want to remove and Python will remove the first item with that
value from the list.
dog.remove(2001)
dog
a.count(arg)
returns the number of the specified value in a list.
Example:
dog.count('Freddie')
22
a.clear()
removes all elements of the list. It will basically delete Freddie. No worries, we will get him back.
Example:
dog.clear()
dog
By the way, here you can find the full list of list methods in
Python: https://docs.python.org/3/tutorial/datastructures.html
As with the lists, there are some important dictionary functions to learn about.
Here’s Freddie again (see, I told you he’d be back):
dog_dict = {'name': 'Freddie',
'age': 9,
'is_vaccinated': True,
'height': 1.1,
'birth_year': 2001,
'belongings': ['bone', 'little ball']}
dog_dict.keys()
will return all the keys from your dictionary.
23
dog_dict.values()
will return all the values from your dictionary.
dog_dict.clear()
will delete everything from your dictionary.
Note:
Adding an element to a dictionary doesn’t require you to use a method; you have to do it by
simply defining a key-value pair like this:
dog_dict['key'] = 'value'
Eg.
dog_dict['name'] = 'Freddie'
Okay, these are all the methods you should know for now! We went through string, list and
dictionary Python methods!
It’s time to test yourself!
Test yourself!
For this exercise you will have to use not only what you have learned today, but what you have
learned about Python Data Structures and variable types too! Okay, let’s see:
1. Take this list:
test_yourself = [1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]
2. Calculate the mean of the list elements – by using only those things that you have read in
this and the previous articles!
3. Calculate the median of the list elements – by using only those things that you have read
in this and the previous articles!
.
.
.
And the solutions are:
2) sum(test_yourself) / len(test_yourself)
24
Where the sum() sums the numbers and the len() counts the elements. The division of those
will return the mean. The result is: 2.909090
3) test_yourself[round(len(test_yourself) / 2) - 1]
We are lucky to have a list with an odd number of elements.
Note: this formula won’t work for a list with an even number of elements.
len(test_yourself) / 2 will basically tell us where in the list we should look for our middle number
– which will be the median. The result is 5.5, but in fact the result of len() / 2 will always be less
by 0.5 than our exact number – when the list has odd number of elements (check it out for a 3 or
5-element list too). So let’s round this 5.5 up to 6 by using round(len(test_yourself) / 2). That’s
right: we can put a function into a function. Then subtract one because of the zero-based
indexing: round(len(test_yourself) / 2) - 1
And eventually use this result as the index of the list: test_yourself[round(len(test_yourself) / 2) -
1] or replace it with the exact number: test_yourself[5]. The result is: 3.
25
What’s the difference between Python functions and methods?
After reading this far in the article, I bet you have this question: “Why on Earth do we have both
functions and methods, when they practically do the same thing?”
I remember that when I started learning Python, I had a hard time answering this question. This
is still the most confusing topic for newcomers in the Python-world… The full answer is very
technical and you are not there yet. But here’s a little help for you to avoid confusion.
Firstly, start with the obvious. There is a clear difference in the syntax:
A function looks like this: function(something)
And a method looks like this: something.method()
(Look at the examples above!)
So why do we have both methods and functions in Python? The official answer is that there is a
small difference between them. Namely: a method always belongs to an object (e.g. in
the dog.append(4) method .append() needed the dog object to be applicable), while a function
doesn’t necessarily. To make this answer even more twisted: a method is in fact nothing else but
a specific function. Got it? All methods are functions, but not all functions are methods!
If this makes no sense to you (yet), don’t you worry. I promise, the idea will grow on you as you
use Python more and more – especially when you start to define your own functions and
methods.
But just in case, here’s a little extra advice from me:
In the beginning, learning Python functions and methods will be like learning the articles (der,
die, das) of the German language. You have to learn the syntax, use it the way you have learned
and that’s it.
Just like in German, there are some general rules of thumb that can help you recall things. The
main one is that functions are usually applicable for multiple type of objects, while methods are
not. E.g. sorted() is a function and it works with strings, lists, integers, etc. While .upper() is a
method and it only works with strings.
But again: my general advice here is that you should not put too much effort into understanding
the difference between methods and functions at this point; just learn the ones I mentioned in
this article and you’ll be a happy Python user.
Conclusion
Great, you have learned 20+ Python methods and functions. This is a good start, but remember:
these are only the basics. In the next episodes, we will rapidly extend this list by importing new
data science Python libraries with new functions and new methods!
As a next step, let’s learn a bit about loops and if statements! Here is the link to continue: Python
If Statements (Explained).
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
26
Python If Statements Explained (Python for Data Science
Basics #4)
Written by Tomi Mester on January 8, 2018
Last updated on August 03, 2019
We use if statements in our everyday life all the time – even if our everyday life is not written in
Python. If the light is green then I’ll cross the road; otherwise I’ll wait. If the sun is up then I’ll get
out of bed; otherwise I’ll go back to sleep. Okay, maybe it’s not this direct, but when we take
actions based on conditions, our brain does what a computer would do: evaluate the conditions
and act upon the results. Well, a computer script doesn’t have a subconscious mind, so for
practicing data science we have to understand how an if statement works and how we can apply
it in Python!
Let’s say we have two values: a = 10 and b = 20. We compare these two values: a == b. This
comparison has either a True or a False output. (Test it in your Jupyter Notebook!)
27
else:
print('no')
Run this mini script in your Jupyter Notebook! The result will be (obviously): no.
Now, try the same – but set b to 10!
a = 10
b = 10
if a == b:
print('yes')
else:
print('no')
The returned message is yes.
28
Python if statement syntax
Let’s take a look at the syntax, because it has pretty strict rules.
The basics are simple:
You have:
1. an if keyword, then
2. a condition, then
3. a statement, then
4. an else keyword, then
5. another statement.
29
If you miss any of the above two, an error message will be returned saying “invalid syntax” and
your Python script will fail.
Of course, you can make this even more complex if you want, but the point is: having multiple
operators in an if statement is absolutely possible – in fact, it’s pretty common in real life
scenarios!
30
Another example:
a = 10
b = 11
c = 10
if a == b:
print('first condition is true')
elif a == c:
print('second condition is true')
else:
print('nothing is true. existence is pain.')
Sure enough the result will be "second condition is true".
You can do this infinite times, and build up a huge if-elif-elif-…-elif-else sequence if you want!
Aaand… This was more or less everything you have to know about Python if statements. It’s time
to:
31
Test yourself!
Here’s a random integer: 918652728452151.
First, I’d like to know 2 things about this number:
Is it divisible by 17?
Does it have more than 12 digits?
If both of these conditions are true, then I want to print “super17“.
And if either of the conditions are false, then I’d like to run a second test on it:
Is it divisible by 13?
Does it have more than 10 digits?
If both of these two new conditions are true, then I want to print “awesome13“.
And if the original number is not classified as “super17” nor “awesome13“, then I’ll just print:
“meh, this is just an average random number“.
So: is 918652728452151 a super17, an awesome13 or just an average random number?
Okay! Ready. Set. Go!
The solution
918652728452151 is a super17 number!
32
line, only it would have checked the divisibility by 13 (and not 17) and the number of digits
should have been greater than 10 (and not 12.)
If that were not true either, my else statement would have been run to print("meh, this is just a
random number")
That’s the solution! Wasn’t too difficult, was it?
Summary
If statements are widely used in every programming language. Now you know how to use them
too! The logic of it is super clear and on top of that, in Python, the syntax is even fully
understandable by simply speaking in English…
Anyway. This was my introduction into Python If Statements. Next time we will continue
with Python for loops!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
33
Python For Loops Explained (Python for Data Science Basics
#5)
Written by Tomi Mester on January 17, 2018
Last updated on August 03, 2019
Remember that I told you last time that Python if statements are similar to how our brain
processes conditions in our everyday life? That’s true for for loops too. You go through your
shopping list until you’ve collected every item from it. The dealer gives a card for each player
until everyone has five. The athlete does push-ups until reaching one-hundred… Loops
everywhere! As for for loops in Python: they are perfect for processing repetitive programming
tasks. In this article, I’ll show you everything you need to know about them: the syntax, the logic
and best practices too!
The result is the elements of the list one by one, in separate lines:
Freddie
9
True
1.1
2001
['bone', 'little ball']
Wonderful!
But how does this actually become useful? Take another example, a list of numbers:
numbers = [1, 5, 12, 91, 102]
Say that we want to square each of the numbers in this list!
Note: Unfortunately the numbers * numbers formula won’t work… I know, it might sound logical at
first but when you will get more into Python, you will see that in fact, it is not logical at all.
34
We have to do this:
for i in numbers:
print(i * i)
The result will be:
35
Let’s break down this flowchart and study all the little details… I’ll guide you through it step by
step. As an example I’ll use the previous script with the numbers and their squared values:
numbers = [1, 5, 12, 91, 102]
for i in numbers:
print(i * i)
1.) Define an iterable! (E.g. the list we defined earlier: numbers = [1, 5, 12, 91, 102]).
2.) When you set a for loop, the first line will look pretty similar to this:
for and in are Python keywords and numbers is the name of our list… But, what I want to talk
more about in this section is the i variable. It’s a “temporary” variable and its only role is to store
the given element of the list that we will work with in the given iteration of the loop. Even if this
variable is called i most of the time (in online tutorials or books for instance), it’s good to know
that the naming is totally arbitrary. It could not only be i (for i in numbers), but anything else,
like x (for x in numbers) or hello (for hello in numbers) or whatever you prefer… The point is, set
a variable and don’t forget that you have to refer to it when you want to use it inside the loop.
3.) Stick with our numbers-example! We take the first element of our iterable (well, again –
because of zero-based indexing – technically it’s the 0th element of the list). The first iteration of
the loop will run! The 0th element of our list is 1. So the i variable is set to 1.
Note: More info about zero-based indexing: here.
5.) The function itself, inside the loop, was print(i * i). As i = 1 the result of i * i will be 1. 1 will be
printed to our screen.
6.) The loop starts over.
36
7.) We take the next element and since there is an actual next element of the list, the second
iteration of the loop will run! The 1st element of the numbers list is 5.
8.) So i is 5. print(i * i) runs again and the result is printed to our screen: 25.
9.) The loop starts over. We take the next element.
10.) There is a next element. So here comes the third iteration. The 2nd element of the numbers
list is 12.
11.) print(i * i) is 144.
12.) The loop starts over. The next element exists. The iteration runs again.
13.) The 3rd element is 91. The squared value of it is 8281.
14.) The loop starts over. Next element exists. The iteration runs again.
15.) i is 102. The squared value of it is 10404.
16.) The loop starts over. But there is no more “next element.” So the loop ends.
This is a very, very detailed explanation for a 3 line script, right? Don’t worry, it’s enough if you
only crunch this once. In the future, you can just go ahead and use those 3 simple lines, because
the underlying logic will be in the back of your mind! I find it very important to write this down
though, because many junior data professionals do not have this in their back of mind… and
that reduces the quality of their Python scripts.
Iterating through strings
Okay, going forward!
As I mentioned earlier, you can use other sequences than lists too. Let’s try a string:
my_list = "Hello World!"
for i in my_list:
print(i)
37
Iterating through range() objects
range() is a built-in function in Python and we use it almost exclusively within for loops. What
does it do? In a nutshell: it generates a list of numbers. Let’s see how it works:
my_list = range(0,10)
for i in my_list:
print(i)
38
Note: the first element and the step attributes are optional. If you don’t specify them, then the first
element will be 0 and the step will be 1 by default. Try this in your Jupyter Notebook and check the
result:
my_list = range(10)
for i in my_list:
print(i)
When range() can be useful? Mostly, in these two cases:
1.) You want to go through numbers. For instance, you want to cube the integers between 0 and
9? Not a problem:
my_list = range(1,10,2)
for i in my_list:
print(i * i * i)
2.) You want to go through a list but want to keep the indexes of the elements too.
my_list = [1, 5, 12, 91, 102]
my_list_length = len(my_list)
for i in range(0,my_list_length):
print(i, my_list[i] * my_list[i])
39
In this case i will be the index and you can get the actual elements of the list with
the my_list[i] syntax – just as we have learned in the Python Data Structures article.
Anyway: use range() – it will make your job with Python for loops easier!
3. You can’t print strings and integers in one print() function by simply using the + sign. This
is more a print-function-thing than a for-loop-thing but most of the time you will meet
this issue in for loops. E.g.:
If you see this, one of the good solutions is: turning your integers into strings by using
40
the str() function! Here is the previous example with the right syntax:
for i in my_string:
x=x+1
print(my_string[0:x])
for i in my_string:
41
x=x-1
print(my_string[0:x])
I think the solution is quite self-explanatory. The only trick is that I did set a “counter-variable”
called x that always shows the number of characters that I want to print to the screen in the
given iteration. In the first for loop this goes up until I reach the number of maximum characters.
After that, in the second for loop, it goes down until I have zero characters on the screen.
Note: If the my_string[0:x] syntax does not look familiar, check the Python Data Structures article –
and the “How to access multiple elements of a Python list?” section.
Conclusion
Python for loops are important and they are used widely in data scripts. The syntax is simple, but
as you have seen, to fully understand the logic behind it requires a little bit of brainwork. But by
reading this article, you got through it and now you have a solid foundation to build on. So all
42
Python For Loops and If Statements Combined (Python for
Data Science Basics #6)
Written by Tomi Mester on April 11, 2018
Last updated on August 03, 2019
Last time I wrote about Python For Loops and If Statements. Today we will talk about how to
combine them. In this article, I’ll show you – through a few practical examples – how to combine
a for loop with another for loop and/or with an if statement!
Note: This is a hands-on tutorial. I highly recommend doing the coding part with me – and if you have
time, solving the exercises at the end of the article! If you haven’t done so yet, please work through
these articles first:
How to install Python, R, SQL and bash to practice data science!
Python for Data Science #1 – Tutorial for Beginners – Python Basics
Python for Data Science #2 – Data Structures
Python for Data Science #3 – Functions and methods
Python for Data Science #4 – If statements
Python for Data Science #5 – For loops
Note 2: On mobile the line breaks of the code snippets might look tricky. But if you copy-paste them
into your Jupyter Notebook, you will see the actual line breaks much clearer!
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
For loop within a for loop – aka the nested for loop
The more complicated the data project you are working on, the higher the chance that you will
bump into a situation where you have to use a nested for loop. This means that you will run an
iteration, then another iteration inside that iteration.
Let’s say you have nine TV show titles put into three categories: comedies, cartoons, dramas.
These are presented in a nested Python list (“lists in a list”):
my_movies = [['How I Met Your Mother', 'Friends', 'Silicon Valley'],
['Family Guy', 'South Park', 'Rick and Morty'],
['Breaking Bad', 'Game of Thrones', 'The Wire']]
You want to count the characters in all these titles and print the results one by one to your
screen, in this format:
"The title [movie_title] is [X] characters long."
How would you do that? Since you have three lists in your main list, to get the movie titles, you
have to iterate through your my_movies list — and inside that list, through every sublist, too:
for sublist in my_movies:
for movie_name in sublist:
char_num = len(movie_name)
print("The title " + movie_name + " is " + str(char_num) + " characters long.")
Note: remember len() is a Python function that results in an integer. To put this integer into a
“printable” sentence, we have to turn it into a string first. I wrote about this in the previous Python For
Loops tutorial.
43
I know, Python for loops can be difficult to understand for the first time… Nested for loops are
even more difficult. If you have trouble understanding what exactly is happening above, get a
pen and a paper and try to simulate the whole script as if you were the computer — go through
your loop step by step and write down the results.
One more thing:
Syntax! The rules are the same ones you learned when we discussed simple for loops — the only
thing that I’d like to emphasize, and that you should definitely watch out for, is the indentations.
Using proper indentations is the only way how you can let Python know that in which for loop
(the inner or the outer) you would like to apply your block of code. Just test out and try to find
the differences between these three examples:
Example 1
Example 2
44
Example 3
If statement within a for loop
Inside a for loop, you can use if statements as well.
Let me use one of the most well-known examples of the exercises that you might be given as the
opening question in a junior data scientist job interview.
The task is:
Go through all the numbers up until 99. Print ‘fizz’ for every number that’s divisible by 3, print
‘buzz’ for every number divisible by 5, and print ‘fizzbuzz’ for every number divisible by 3 and by
5! If the number is not divisible either by 3 or 5, print a dash (‘-‘)!
Here’s the solution!
for i in range(100):
if i % 3 == 0 and i % 5 == 0:
print('fizzbuzz')
elif i % 3 == 0:
print('fizz')
elif i % 5 == 0:
print('buzz')
else:
print('-')
As you can see, an if statement within a for loop is perfect to evaluate a list of numbers in a
range (or elements in a list) and put them into different buckets, tag them, or apply functions on
them – or just simply print them.
45
Again: when you use an if statement within a for loop, be extremely careful with
the indentations because if you misplace them, you can get errors or fake results!
Break
There is a special control flow tool in Python that comes in handy pretty often when using if
statements within for loops. And this is the break statement.
Can you find the first 7-digit number that’s divisible by 137? (The first one and only the first one.)
Here’s one solution:
for i in range(0, 10000000, 137):
if len(str(i)) == 7:
print(i)
break
This loop takes every 137th number (for i in range(0, 10000000, 137)) and it checks during each
iteration whether the number has 7 digits or not (if len(str(i)) == 7). Once it gets to the the first 7-
digit number, the if statement will be True and two things happen:
1. print(i) –» The number is printed to the screen.
2. break breaks out of the for loop, so we can make sure that the first 7-digit number was
also the last 7-digit number that was printed on the screen.
Learn more about the break statement (and its twin brother: the continue statement) in the
original Python3 documentation: here.
Note: you can solve this task more elegantly with a while loop. However, I haven’t written a while loop
tutorial yet, which is why I went with the for loop + break solution!
Test Yourself!
It’s time to test whether you have managed to master the if statement, the for loops and the
combination of these two! Let’s try to solve this small test assignment!
Create a Python script that finds out your age in a maximum of 8 tries! The script can ask you
only one type of question: guessing your age! (e.g. “Are you 67 years old?”) And you can answer
only one of these three options:
less
more
correct
Based on your answer the computer can come up with another guess until it finds out your
exact age.
Note: to solve this task, you will have to learn a new function, too. That’s the input() function! More
info: here.
Ready? 3. 2. 1. Go!
Solution
Here’s my code.
Note 1: One can solve the task with a while loop, too. Again: since I haven’t written about while loops
yet, I’ll show you the for loop solution.
Note 2: If you have an alternative solution, please do not hesitate to share it with me and the other
readers in the comment section below!
down = 0
46
up = 100
for i in range(1,10):
guessed_age = int((up + down) / 2)
answer = input('Are you ' + str(guessed_age) + " years old?")
if answer == 'correct':
print("Nice")
break
elif answer == 'less':
up = guessed_age
elif answer == 'more':
down = guessed_age
else:
print('wrong answer')
My logic goes:
STEP 1) I set a range between 0 and 100 and I assume that the age of the “player” will be
between these two values.
down = 0
up = 100
STEP 2) The script always asks the middle value of this range (for the first try it’s 50):
guessed_age = int((up + down) / 2)
answer = input('Are you ' + str(guessed_age) + " years old?")
STEP 3) Once we have the “player’s” answer, there are four possible scenarios:
o If the guessed age is correct, then the script ends and it returns some answer.
o if answer == 'correct':
o print("Nice")
o break
o If the answer is “less”, then we start the iteration over – but before that we set
the maximum value of the age-range to the guessed age. (So in the second
iteration the script will guess the middle value of 0 and 50.)
o elif answer == 'less':
o up = guessed_age
o We do the same for the “more” answer – except that in this case we change the
minimum (and not the the maximum) value:
o elif answer == 'more':
o down = guessed_age
o And eventually we handle the wrong answers and the typos:
o else:
print('wrong answer')
47
Did you find a better solution?
Share it with me in the comment section below!
Conclusion
Now you’ve got the idea of:
Python nested for loops and
for loops and if statements combined.
They are not necessarily considered to be Python basics; this is more like a transition to the
intermediate level. Using them requires a solid understanding of Python3’s logic – and a lot of
practicing, too.
There are only two episodes left from the Python for Data Science Basics tutorial series! Keep it
going and continue with the Python syntax essentials!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi
48
Python Syntax Essentials and Best Practices
Written by Tomi Mester on April 24, 2018
Last updated on August 03, 2019
In my Python workshops and online courses I see that one of the trickiest things for newcomers
is the syntax itself. It’s very strict and many things might seem inconsistent at first. In this article
I’ve collected the Python syntax essentials you should keep in mind as a data professional — and
I added some formatting best practices as well, to help you keep your code nice and clean.
These are the basics. If you want to go deep down the rabbit hole, I’ll link to some advanced
Python syntax and formatting tutorials at the end of this article!
This article is the part of my Python for Data Science article series. If you haven’t done so yet, please
start with these articles first:
How to install Python, R, SQL and bash to practice data science!
Python for Data Science #1 – Tutorial for Beginners – Python Basics
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
The 3 major things to keep in mind about Python syntax
#1 Line Breaks Matter
Unlike in SQL, in Python, line breaks matter. Which means that in 99% of cases, if you put a line
break where you shouldn’t put one, you will get an error message. Is it weird? Hey, at least you
don’t have to add semicolons at the end of every line.
So here’s Python syntax rule #1: one statement per line.
There are some exceptions, though. Expressions
in parentheses (eg. functions and methods),
in bracket frames (eg. lists),
and in curly braces (eg. directories)
can actually be split into more lines. This is called implicit line joining and it is a great help when
working with bigger data structures.
49
common, and I’d recommend using them only when necessary. (E.g. with really long, 80+
character long statements.)
One more thing: if you watch the Silicon Valley TV show, you might have heard about the debate
of “tabs vs spaces.” Here’s the hilarious scene:
So tabs or spaces? Here’s what the original Style Guide for Python Code says:
50
Pretty straight forward!
ps. To be honest, in Jupyter Notebook, I use tabs.
#3 Case Sensitivity
Python is case sensitive. It makes a difference whether you type and (correct) or AND (won’t
work). As a rule of thumb, learn that most of the Python keywords have to be written with
lowercase letters. The most commonly used exceptions I have to mention here (because I see
many beginners have trouble with it) are the Boolean values. These are correctly spelled
as: True and False. (Not TRUE, nor true.)
There’s Python syntax rule #3: Python is case sensitive.
Other Python Best Practices for Nicer Formatting
Let me just list a few (non-mandatory but highly recommended) Python best practices that will
make your code much nicer, more readable and more reusable.
Python Best Practice #1: Use Comments
You can add comments to your Python code. Simply use the # character. Everything that comes
after the # won’t be executed.
# This is a comment before my for loop.
for i in range(0, 100, 2):
print(i)
#use comments!
Python Best Practice #2: Variable Names
Conventionally, variable names should be written with lower letters, and the words in them
separated by _ characters. Also, generally I do not recommend using one letter variable names in
your code. Using meaningful and easy-to-distinguish variable names helps other programmers a
lot when they want to understand your code.
my_meaningful_variable = 100
Python Best Practice #3: Use blank lines
If you want to separate code blocks visually (e.g. when you have a 100 line Python script in
which you have 10-12 blocks that belong together) you can use blank lines. Even multiple blank
lines. It won’t affect the result of your script.
51
same script – with blank lines
Python Best Practice #4: Use white spaces around operators and assignments
For cleaner code it’s worth using spaces around your = signs and your mathematical and
comparison operators (>, <, +, -, etc.). If you don’t use white spaces, your code will run anyway,
but again: the cleaner the code, the easier to read it, the easier to reuse it.
number_x = 10
number_y = 100
number_mult = number_x * number_y
Python Best Practice #5: Max line length should be 79 characters
If you reach 79 characters in a line, it’s recommended to break your code into more lines. Use
the above-mentioned \ character. Using the \ at the end of the line, Python will ignore the line
break and will read your code as if it were one line.
(Or in some cases you can take advantage of implicit line joining.)
52
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
53
Python Import Statement and the Most Important Built-in
Modules for Data Scientists
Written by Tomi Mester on May 13, 2018
Last updated on February 01, 2020
So far we have worked with the most essential concepts of Python: variables, data structures,
built-in functions and methods, for loops, and if statements. These are all parts of the core
semantics of the language. But this is far from everything that Python knows… actually this is
just the very beginning and the exciting stuff is yet to come. Because Python also has tons of
modules and packages that we can import into our projects… What does this mean? In this
article I’ll give you an intro: I’ll show you the Python import statement and the most important
built-in modules that you have to know for data science!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science!
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python for Data Science – Basics #2 – Python Data Structures
4. Python for Data Science – Basics #3 – Python Built-in Functions
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
The Python Import Statement
Okay, so what is the import statement and why is it so important?
Think of it like LEGO:
54
Python core semantics
So far we have played around with the base elements of our LEGO playset. But if you want to
build something complex, you have to use more advanced tools.
If you use import, you can get access to the advanced Python “tools” (they are called modules).
55
New tools to access via the Python import statement
These are divided into three groups:
1. The modules of the Python Standard Library:
You can get these really easily because they come with Python3 by default. You simply
have to type import and the name of the module – and from that point on you can use
the given module in your code. In this article, I’ll show you exactly how to do that in
detail.
2. Other, even more advanced and more specialized modules:
There are modules that are not part of the standard library. For these, you have to install
new packages to your data server first. You will see that for data science we are using
many of these “external” packages. (The ones you might have heard about are pandas,
numpy, matplotlib, scikit-learn, etc.) I’ll get back to this topic in another article.
3. Your own modules:
Yes, you can write new modules by yourself, too! (I’ll cover this in my advanced Python
tutorials.)
Anyway, import is a really powerful concept in Python – because with that you’ll be able to
expand your toolset continuously and almost infinitely when you are dealing with different data
science challenges.
The most important Python Built-in Modules for Data Scientists
Okay, now that you get the concept, it’s time to see it in practice. As I have mentioned, there is a
Python Standard Library with dozens of built-in modules. From those, I hand-picked the five
most important modules for data analysts and scientists. These are:
56
random
statistics
math
datetime
csv
You can easily import any of them by using this syntax:
import [module_name]
eg. import random
Note: This will import the entire module with all items in it. You can import only a part of the module,
too: from [module_name] import [item_name]. But let’s not complicate things with that yet.
Let’s see the five built-in modules one by one!
Python Built-in Module #1: random
Randomization is very important in data science… just think about experimenting and A/B
testing! If you import the random module, you can generate random numbers by various rules.
Let’s type this to your Jupyter Notebook first:
import random
Then in a separate cell try out:
random.random()
This will generate a random float between 0 and 1.
Try this one, too:
random.randint(1,10)
This will generate a random integer between 1 and 10.
57
Learn more about the statistics module here.
Python Built-in Module #3: math
There are a few functions that are under the umbrella of math rather than statistics. So there is a
separate module for that. This contains factorial, power, and logarithmic functions, but also some
trigonometry and constants.
Try this:
import math
And then:
math.factorial(5)
math.pi
math.sqrt(5)
math.log(256, 2)
58
Learn more about math module here.
Python Built-in Module #4: datetime
Do you plan to work for an online startup? Then you will probably encounter lot of data logs.
And the heart of a data log is the datetime. Python3, by default, does not handle dates and
times, but if you import the datetime module, you will get access to these functions, too.
import datetime
To be honest, I think the implementation of the datetime module of Python is a bit over-
complicated… at least, it’s not easy to use for beginners. I’ll write a separate article about it later.
But for now let’s try these two functions to get a bit more familiar with it:
datetime.datetime.now()
datetime.datetime.now().strftime("%F")
59
If you want to open this file in your Jupyter Notebook, you have to apply this code:
import csv
As you can see, it returned Python lists. So with the list selection features and with the list
methods that we have learned previously, you can also break down and restructure this data.
Learn more about the csv module: here.
More built-in modules
This is a good start but far from the whole list of the Python built-in modules. With other
modules you can zip and unzip files, scrape websites, send emails, encode and decode JSON files
and do a lot of other exciting things. If you want to take a look at the whole list, check out
the Python Standard Library which is part of the original Python documentation.
And, as I mentioned, there are other Python libraries and packages that are not part of the
standard library (like pandas, numpy, scipy, etc.) – I’ll write more about them soon!
Syntax
Now that you have seen how import works, let’s talk briefly about the syntax!
60
Three things:
1. Usually, in Python scripts, we put all the import statements at the beginning of our script.
Why is that? To see what modules our script relies on. Also, to make sure that the
modules will be imported before we need to apply them. So keep this advice in
mind: import statements come at the beginning of your Python scripts.
2. In this article, we applied the functions of the modules using this syntax:
module_name.function_name(parameters)
Eg. statistics.median(a)
or
csv.reader(csvfile, delimiter=';')This is logical: before you apply a given function, you have
to tell Python in which module to find it.In some cases there are even more complicated
relationships – like functions of classes in a module (eg. datetime.datetime.now()) but let’s
not confuse yourself with that for now. My suggestion is to make a list of your favorite
modules and functions and learn how they work; if you need a new one, check out the
original Python documentation and add the new module plus its function to your list.
3. When you import a module (or a package) you can rename it using the as keyword:
If you type:
import statistics as stat
you have to refer to your module as stat. Eg. stat.median(a) and not
as statistics.median(a).Conventionally, we are using two very well-known data science
related Python libraries imported with their shortened name: numpy (import numpy as
np) and pandas (import pandas as pd). I’ll get back to this in another article!
So what’s the name of it? Package? Module? Function? Library?
When I first encountered this import concept, I had a hard time understanding what exactly I was
importing. In some cases these things were referred as “modules”, some cases as “packages”, in
other cases as “functions”, and sometimes as “libraries”.
Note: Even if you check numpy and pandas – the two most popular data science related python
libraries (or packages??). One is called library, the other is called a package.
61
the numpy >>package<<
62
thousands of new doors. In my upcoming articles I’ll introduce the most important Python
libraries and packages you have to know as a data scientist!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
63
Python libraries and packages for Data Scientists (the 5 most
important ones)
Written by Tomi Mester on June 26, 2018
Last updated on July 22, 2020
Did you know that Python wasn’t originally built for Data Science? And yet today it’s one of the
best languages for statistics, machine learning, and predictive analytics as well as simple data
analytics tasks. How come? It’s an open-source language, and data professionals started creating
tools for it to complete data tasks more efficiently. Here, I’ll introduce the most important
Python libraries and packages that you have to know as a Data Scientist.
In my previous article, I introduced the Python import statement and the most important
modules from the Python Standard Library. In this one, I’ll focus on the libraries and packages
that are not coming with Python 3 by default. At the end of the article, I’ll also show you how to
get (download, install and import) them.
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python for Data Science – Basics #2 – Python Data Structures
4. Python for Data Science – Basics #3 – Python Built-in Functions
5. Python Import Statement and the Most Important Built-in Modules
Top 5 most important Python libraries and packages for Data Science
Numpy
Pandas
Matplotlib
Scikit-Learn
Scipy
These are the five most essential Data Science libraries you have to know.
Let’s see them one by one!
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
Numpy
Numpy will help you to manage multi-dimensional arrays very efficiently. Maybe you won’t do
that directly, but since the concept is a crucial part of data science, many other libraries (well,
almost all of them) are built on Numpy. Simply put: without Numpy you won’t be able to use
Pandas, Matplotlib, Scipy or Scikit-Learn. That’s why you need it on the first hand.
64
3-dimensional numpy array
But on the other hand, it also has a few well-implemented methods. I quite often use Numpy’s
random function, which I found slightly better than the random module of the standard library.
And when it comes to simple predictive analytics tasks like linear or polynomial regression,
Numpy’s polyfit function is my favorite. (More about that in another article.)
65
prediction with numpy’s polyfit
Pandas
To analyze data, we like to use two-dimensional tables – like in SQL and in Excel. Originally,
Python didn’t have this feature. Weird, isn’t it? But that’s why Pandas is so important! I like to
say, Pandas is the “SQL of Python.” (Eh, I can’t wait to see what I will get for this sentence in the
comment section… ;-)) Okay, to be more precise: Pandas is the library that will help us to handle
two-dimensional data tables in Python. In many senses it’s really similar to SQL, though.
66
a pandas dataframe
With pandas, you can load your data into data frames, you can select columns, filter for specific
values, group by values, run functions (sum, mean, median, min, max, etc.), merge dataframes and
so on. You can also create multi-dimensional data-tables.
That’s a common misunderstanding, so let me clarify: Pandas is not a predictive analytics or
machine learning library. It was created for data analysis, data cleaning, data handling and data
discovery… By the way, these are the necessary steps before you run machine learning projects,
and that’s why you will need pandas for every scientific project, too.
If you start with Python for Data Science and you learned the basics of Python, I recommend
that you focus on learning Pandas next. These short article series of mine will help you: Pandas
for Data Scientists.
Matplotlib
I hope I don’t have to detail why data visualization is important. Data visualization helps you to
better understand your data, discover things that you wouldn’t discover in raw format and
communicate your findings more efficiently to others.
The best and most well-known Python data visualization library is Matplotlib. I wouldn’t say it’s
easy to use… But usually if you save for yourself the 4 or 5 most commonly used code blocks for
basic line charts and scatter plots, you can create your charts pretty fast.
67
matplotlib dataviz example
Here’s another article that introduces Matplotlib more in-depth: How to use matplotlib.
Scikit-Learn
Without any doubt the fanciest things in Python are Machine Learning and Predictive Analytics.
And the best library for that is Scikit-Learn, which simply defines itself as “Machine Learning in
Python.” Scikit-Learn has several methods, basically covering everything you might need in the
first few years of your data career: regression methods, classification methods, and clustering, as
well as model validation and model selection. You can also use it for dimensionality reduction
and feature extraction.
68
(Get started with my machine learning tutorials here: Linear Regression in Python using sklearn
and numpy!)
69
source: pythonprogramming.net
Scipy
This is kind of confusing, but there is a Scipy library and there is a Scipy stack. Most of the
libraries and packages I wrote about in this article are part of the Scipy stack (that is for scientific
computing in Python). And one of these components is the Scipy library itself, which provides
efficient solutions for numerical routines (the math stuff behind machine learning models). These
are: integration, interpolation, optimization, etc.
Just like Numpy, you most probably won’t use Scipy itself, but the above-mentioned Scikit-Learn
library highly relies on it. Scipy provides the core mathematical methods to do the complex
machine learning processes in Scikit-learn. That’s why you have to know it.
More Python libraries and packages for data science…
What about image processing, natural language processing, deep learning, neural nets, etc.? Of
course, there are numerous very cool Python libraries and packages for these, too. In this article,
I won’t cover them because I think, for a start, it’s worth taking time to get familiar with the
70
above mentioned five libraries. Once you get fluent with them, then and only then you can go
ahead and expand your horizons with more specific data science libraries.
How to get Pandas, Numpy, Matplotlib, Scikit-Learn and Scipy?
First of all, you have to set up a basic data server by following my original How to install Python,
R, SQL and bash to practice data science article. Once you have that, you can install these tools
additionally, one by one. Just follow these five steps:
1. Login to your data server!
2. Install numpy using this command:
sudo -H pip3 install numpy
3. Install pandas using this command:
sudo apt-get install python3-pandas
4. Upgrade some additional tools of pandas using these two commands:
sudo -H pip3 install --upgrade beautifulsoup4
sudo -H pip3 install --upgrade html5lib
5. Upgrade Scipy:
sudo -H pip3 install --upgrade scipy
6. Install scikit-learn using this command:
sudo -H pip3 install scikit-learn
Once you have them installed, import them (or specific modules of them) into your Jupyter
notebook by using the right import statements. For instance:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
After this you can even test pandas and matplotlib together by running these few lines:
df = pd.DataFrame({'a':[1,2,3,4,5,6,7],
'b':[1,4,9,16,25,36,49]})
df.plot()
71
If you need detailed, step-by-step guidance with this setup process, check out my Install Python,
R, SQL and bash – to practice Data Science and Coding! video course.
Conclusion
The five most essential Data Science libraries and packages are:
Numpy
Pandas
Matplotlib
Scikit-Learn
72
Scipy
Get them, learn them, use them and they will open a lot of new doors in your data science
career!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
73
Pandas Tutorial 1: Pandas Basics (Reading Data Files,
DataFrames, Data Selection)
Written by Tomi Mester on July 10, 2018
Last updated on August 03, 2019
Pandas is one of the most popular Python libraries for Data Science and Analytics. I like to say
it’s the “SQL of Python.” Why? Because pandas helps you to manage two-dimensional data
tables in Python. Of course, it has many more features. In this pandas tutorial series, I’ll show you
the most important (that is, the most often used) things that you have to know as an Analyst or a
Data Scientist. This is the first episode and we will start from the basics!
Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
How to install Python, R, SQL and bash to practice data science
Python for Data Science – Basics #1 – Variables and basic operations
Python Import Statement and the Most Important Built-in Modules
Top 5 Python Libraries and Packages for Data Scientists
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
To follow this pandas tutorial…
1. You will need a fully functioning data server with Python3, numpy and pandas on it.
Note 1 : Again, with this tutorial you can set up your data server and Python3. And with this
article you can set up numpy and pandas, too.
Note 2: or take this step-by-step data server set up video course.
2. Next step: log in to your server and fire up Jupyter. Then open a new Jupyter Notebook
in your favorite browser. (If you don’t know how to do that, I really do recommend going
through the articles I linked in the “Before we start” section.)
Note: I’ll also rename my Jupyter Notebook to “pandas_tutorial_1”.
74
Firing up Jupyter Notebook
3. Import numpy and pandas to your Jupyter Notebook by running these two lines in a cell:
4. import numpy as np
5. import pandas as pd
Note: It’s conventional to refer to ‘pandas’ as ‘pd’. When you add the as pd at the end of your import
statement, your Jupyter Notebook understands that from this point on every time you type pd, you
are actually referring to the pandas library.
Okay, now we have everything! Let’s start with this pandas tutorial!
The first question is:
How to open data files in pandas
75
You might have your data in .csv files or SQL tables. Maybe Excel files. Or .tsv files. Or
something else. But the goal is the same in all cases. If you want to analyze that data using
pandas, the first step will be to read it into a data structure that’s compatible with pandas.
Pandas data structures
There are two types of data structures in pandas: Series and DataFrames.
Series: a pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can
store values — and for every value it holds a unique index, too.
76
Start with a simple demo data set, called zoo! This time – for the sake of practicing – you will
create a .csv file for yourself! Here’s the raw data:
animal,uniq_id,water_need
elephant,1001,500
elephant,1002,600
elephant,1003,550
tiger,1004,300
tiger,1005,320
tiger,1006,330
tiger,1007,290
tiger,1008,310
zebra,1009,200
zebra,1010,220
zebra,1011,240
zebra,1012,230
zebra,1013,220
zebra,1014,100
zebra,1015,80
lion,1016,420
lion,1017,600
lion,1018,500
lion,1019,390
kangaroo,1020,410
kangaroo,1021,430
kangaroo,1022,410
Go back to your Jupyter Home tab and create a new text file…
…then copy-paste the above zoo data into this text file…
77
… and then rename this text file to zoo.csv!
78
And there you go! This is the zoo.csv data file, brought to pandas. This nice 2D table? Well, this is
a pandas dataframe. The numbers on the left are the indexes. And the column names on the top
are picked up from the first row of our zoo.csv file.
To be honest, though, you will probably never create a .csv data file for yourself, like we just
did… you will use pre-existing data files. So you have to learn how to download .csv files to your
server!
79
If you are here from the Junior Data Scientist’s First Month video course then you have already
dealt with downloading your .txt or .csv data files to your data server, so you must be pretty
proficient in it… But if you are not here from the course (or if you want to learn another way to
download a .csv file to your server and to get another exciting dataset), follow these steps:
I’ve uploaded a small sample dataset here: DATASET
(Link: 46.101.230.157/dilan/pandas_tutorial_read.csv)
If you click the link, the data file will be downloaded to your computer. But you don’t want to
download this data file to your computer, right? You want to download it to your server and then
load it to your Jupyter Notebook. It only takes two steps.
STEP 1) Go back to your Jupyter Notebook and type this command:
!wget 46.101.230.157/dilan/pandas_tutorial_read.csv
This downloaded the pandas_tutorial_read.csv file to your server. Just check it out:
80
Does something feel off? Yes, this time we didn’t have a header in our csv file, so we have to set
it up manually! Add the names parameter to your function!
pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country',
'user_id', 'source', 'topic'])
Better!
And with that, we finally loaded our .csv data into a pandas dataframe!
Note 1: Just so you know, there is an alternative method. (I don’t prefer it though.) You can load the
.csv data using the URL directly. In this case the data won’t be downloaded to your data server.
81
read the .csv directly from the server (using its URL)
Note 2: If you are wondering what’s in this data set – this is the data log of a travel blog. This is a log
of one day only (if you are a JDS course participant, you will get much more of this data set on the last
week of the course ;-)). I guess the names of the columns are fairly self-explanatory.
Selecting data from a dataframe in pandas
This is the first episode of this pandas tutorial series, so let’s start with a few very basic data
selection methods – and in the next episodes we will go deeper!
1) Print the whole dataframe
The most basic method is to print your whole data frame to your screen. Of course, you don’t
have to run the pd.read_csv() function again and again and again. Just store its output the first
time you run it!
article_read = pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime',
'event', 'country', 'user_id', 'source', 'topic'])
After that, you can call this article_read value anytime to print your DataFrame!
82
Sometimes, it’s handy not to print the whole dataframe and flood your screen with data. When a
few lines is enough, you can print only the first 5 lines – by typing:
article_read.head()
83
Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I
admit, but maybe this will help you remember: the outer bracket frames tell pandas that you
want to select columns, and the inner brackets are for the list (remember? Python lists go between
bracket frames) of the column names.
By the way, if you change the order of the column names, the order of the returned columns will
change, too:
article_read[['user_id', 'country']]
84
article_read.user_id
article_read['user_id']
STEP 2) Then from the article_read table, it prints every row where this value is True and doesn’t
print any row where it’s False.
85
Does it look over-complicated? Maybe. But this is the way it is, so let’s just learn it because you
86
ar_filtered_cols.head()
Either way, the logic is the same. First you take your original dataframe (article_read), then you
filter for the rows where the country value is country_2 ([article_read.country == 'country_2']),
then you take the three columns that were required ([['user_id','topic', 'country']]) and eventually
you take the first five rows only (.head()).
Conclusion
You are done with the first episode of my pandas tutorial series! Great job! In the next article,
you can learn more about the different aggregation methods (e.g. sum, mean, max, min) and
about grouping (so basically about segmentation). Stay with me: Pandas Tutorial, Episode 2!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
87
Pandas Tutorial 2: Aggregation and Grouping
Written by Tomi Mester on July 23, 2018
Last updated on August 03, 2019
Let’s continue with the pandas tutorial series. This is the second episode, where I’ll introduce
aggregation (such as min, max, sum, count, etc.) and grouping. Both are very commonly used
methods in analytics and data science projects – so make sure you go through every detail in this
article!
Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python Import Statement and the Most Important Built-in Modules
4. Top 5 Python Libraries and Packages for Data Scientists
5. Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection)
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
Data aggregation – in theory
Aggregation is the process of turning the values of a dataset (or a subset of it) into one single
value. Let me make this clear! If you have a DataFrame like…
animal water_need
zebra 100
lion 350
elephant 670
kangaroo 200
…then a simple aggregation method is to calculate the summary of the water_needs, which is
100 + 350 + 670 + 200 = 1320. Or a different aggregation method would be to count the
number of the animals, which is 4. So the theory is not too complicated. Let’s see the rest in
practice…
Data aggregation – in practice
Where did we leave off last time? We opened a Jupyter notebook, imported pandas and numpy
and loaded two datasets: zoo.csv and article_reads. We will continue from here – so if you
haven’t done the “pandas tutorial – episode 1“, it’s time to go through it!
88
Okay!
Let’s start with our zoo dataset! (If you want to download it again, you can find it at this link.) We
have loaded it by using:
pd.read_csv('zoo.csv', delimiter = ',')
89
Oh, hey, what are all these lines? Actually, the .count() function counts the number of values
in each column. In the case of the zoo dataset, there were 3 columns, and each of them had 22
values in it.
If you want to make your output clearer, you can select the animal column first by using one of
the selection operators from the previous article:
zoo[['animal']].count()
Or in this particular case, the result could be even nicer if you use this syntax:
zoo.animal.count()
This also selects only one column, but it turns our pandas dataframe object into a pandas series
object. (Which means that the output format is slightly different.)
Just out of curiosity, let’s run our sum function on all columns, as well:
zoo.sum()
Note: I love how .sum() turns the words of the animal column into one string of animal names. (By the
way, it’s very much in line with the logic of Python.)
Pandas Data Aggregation #3 and #4: .min() and .max()
What’s the smallest value in the water_need column? I bet you have figured it out already:
zoo.water_need.min()
90
And the max value is pretty similar:
zoo.water_need.max()
zoo.water_need.median()
Okay, this was easy. Much, much easier than the aggregation methods of SQL.
But let’s spice this up with a little bit of grouping!
Grouping in pandas
As a Data Analyst or Scientist you will probably do segmentations all the time. For instance, it’s
nice to know the mean water_need of all animals (we have just learned that it’s 347.72). But very
often it’s much more actionable to break this number down – let’s say – by animal types. With
that, we can compare the species to each other – or we can find outliers.
Here’s a simplified visual that shows how pandas performs “segmentation” (grouping and
aggregation) based on the column values!
91
Pandas .groupby in action
Let’s do the above presented grouping and aggregation for real, on our zoo DataFrame!
We have to fit in a groupby keyword between our zoo variable and our .mean() function:
zoo.groupby('animal').mean()
Just as before, pandas automatically runs the .mean() calculation for all remaining columns
(the animal column obviously disappeared, since that was the column we grouped by). You can
either ignore the uniq_id column, or you can remove it afterwards by using one of these
syntaxes:
zoo.groupby('animal').mean()[['water_need']] –» This returns a DataFrame object.
zoo.groupby('animal').mean().water_need –» This returns a Series object.
92
Obviously, you can change the aggregation method from .mean() to anything we learned above!
Okay! Now you know everything, you have to know!
It’s time to…
Test yourself #1
Let’s get back to our article_read dataset.
(Note: Remember, this dataset holds the data of a travel blog. If you don’t have the data yet, you can
download it from here. Or you can go through the whole download, open, store process step by step
by reading the previous episode of this pandas tutorial.)
93
You can – optionally – remove the unnecessary columns and keep the user_id column only:
article_read.groupby('source').count()[['user_id']]
Test yourself #2
Here’s another, slightly more complex challenge:
For the users of country_2, what was the most frequent topic and source combination? Or in
other words: which topic, from which source, brought the most views from country_2?
.
.
.
The result is: the combination of Reddit (source) and Asia (topic), with 139 reads!
And the Python code to get this results is:
article_read[article_read.country == 'country_2'].groupby(['source', 'topic']).count()
94
group by multiple columns? Now you know that! (Syntax-wise, watch out for one thing:
you have to put the name of the columns into a list. That’s why the bracket frames go between
the parentheses.) (That was the groupby(['source', 'topic']) part.)
And as per usual: the count() function is the last piece of the puzzle.
Conclusion
This was the second episode of my pandas tutorial series. I hope now you see that aggregation
and grouping is really easy and straightforward in pandas… and believe me, you will use them a
lot!
Note: If you have used SQL before, I encourage you to take a break and compare the pandas and the
SQL methods of aggregation. With that you will understand more about the key differences between
the two languages!
In the next article, I’ll show you the four most commonly used “data wrangling”
methods: merge, sort, reset_index and fillna. Stay with me: Pandas Tutorial, Episode 3!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
help you.)
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python Import Statement and the Most Important Built-in Modules
4. Top 5 Python Libraries and Packages for Data Scientists
5. Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection)
6. Pandas Tutorial 2: Aggregation and Grouping
95
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
Pandas Merge (a.k.a. “joining” dataframes)
In real life data projects, we usually don’t store all the data in one big data table. We store it in a
few smaller ones instead. There are many reasons behind this; by using multiple data tables, it’s
easier to manage your data, it’s easier to avoid redundancy, you can save some disk space, you
can query the smaller tables faster, etc.
The point is that it’s quite usual that during your analysis you have to pull your data from two or
more different tables. The solution for that is called merge.
Note: Although it’s called merge in pandas, it’s almost the same as SQL’s JOIN method.
Let me show you an example! Let’s take our zoo dataframe (from our previous tutorials) in which
we have all our animals… and let’s say that we have another dataframe, zoo_eats, that contains
information about the food requirements for each species.
We want to merge these two pandas dataframes into one big dataframe. Something like this:
96
In this table, it’s finally possible to analyze, for instance, how many animals in our zoo eat meat or
vegetables.
How did I do the merge?
First of all, you have the zoo dataframe already, but for this exercise you will have to create
a zoo_eats dataframe, too. For your convenience, here’s the raw data of the zoo_eats dataframe:
animal;food
elephant;vegetables
tiger;meat
kangaroo;vegetables
zebra;vegetables
giraffe;vegetables
If I were you, to put this into a proper pandas dataframe, I’d follow the process from the Pandas
Tutorial 1 article, but if you want to do this the lazy way, here’s a shortcut. Just copy-paste
this (really long) one line into the pandas_tutorial_1 Jupyter Notebook we made in the first
Pandas tutorial:
zoo_eats = pd.DataFrame([['elephant','vegetables'], ['tiger','meat'], ['kangaroo','vegetables'],
['zebra','vegetables'], ['giraffe','vegetables']], columns=['animal', 'food'])
97
And there is your zoo_eats dataframe!
Okay, now let’s see the pandas merge method:
zoo.merge(zoo_eats)
(Oh, hey, where are all the lions? We will get back to that soon, I promise!)
Bamm! Simple, right? Just in case, let’s see what’s happening here:
First, I specified the first dataframe (zoo), then I applied the .merge() pandas method on it and as
a parameter I specified the second dataframe (zoo_eats). I could have done this the other way
around:
zoo_eats.merge(zoo)
is symmetric to:
zoo.merge(zoo_eats)
The only difference between the two is the order of the columns in the output table. (Just try it!)
98
Pandas Merge… But how? Inner, outer, left or right?
As you can see, the basic merge method is pretty simple. Sometimes you have to add a few extra
parameters though.
One of the most important questions is how you want to merge these tables. In SQL, we learned
that there are different JOIN types.
99
See? Lions came back, the giraffe came back… The only thing is that we have empty (NaN) values
in those columns where we didn’t get information from the other table.
In my opinion, in this specific case, it would make more sense to keep lions in the table but not
the giraffes… With that, we could see all the animals in our zoo and we would have three food
categories: vegetables, meat and NaN (which is basically “no information”). Keeping the giraffe
line would be misleading and irrelevant since we don’t have any giraffes in our zoo anyway.
That’s when merging with a how = 'left' parameter becomes handy!
Try this:
zoo.merge(zoo_eats, how = 'left')
100
Everything you do need, and nothing you don’t… The how = 'left' parameter brought all the
values from the left table (zoo) but brought only those values from the right table (zoo_eats) that
we have in the left one, too. Cool!
Let’s take a look at our merge types again:
101
Note: a common question I get is “What’s the “safest” way of merging? Should you go with inner,
outer, left or right, as a best practice?” My answer is: there is no categorical answer for this question.
While inner is the default merge type in pandas, whether you should go with that, or change to outer,
left or right, really depends on the task itself.
Pandas Merge. On which column?
For doing the merge, pandas needs the key-columns you want to base the merge on (in our case
it was the animal column in both tables). If you are not so lucky that pandas automatically
recognizes these key-columns, you have to help it by providing the column names. That’s what
the left_on and right_on parameters are for!
For example, our latest left merge could have looked like this, as well:
zoo.merge(zoo_eats, how = 'left', left_on = 'animal', right_on = 'animal')
Note: again, in the previous examples pandas automatically found the key-columns anyway… but
there are many cases when it doesn’t. So keep left_on and right_on in mind.
Okay, pandas merge was quite complex; the rest of the methods I’ll show you here will be much
easier.
Sorting in pandas
Sorting is essential. The basic sorting method is not too difficult in pandas. The function is
called sort_values() and it works like this:
zoo.sort_values('water_need')
102
Note: in the older version of pandas, there is a sort() function with a similar mechanism. But it has
been replaced with sort_values() in newer versions, so learn sort_values() and not sort().
The only parameter I used here was the name of the column I want to sort by, in this case
the water_need column. Quite often, you have to sort by multiple columns, so in general, I
recommend using the by keyword for the columns:
zoo.sort_values(by = ['animal', 'water_need'])
103
Note: you can use the by keyword with one column only, too, like zoo.sort_values(by =
['water_need']).
sort_values sorts in ascending order, but obviously, you can change this and do descending order
as well:
zoo.sort_values(by = ['water_need'], ascending = False)
104
Am I the only one who finds it funny that defining descending is possible only as ascending =
False? Whatever.
Reset_index
(This section is especially important for you if you participate in the Junior Data Scientist’s First Month
video course.)
What a mess with all the indexes after that last sorting, right?
105
It’s not just that it’s ugly… wrong indexing can mess up your visualizations (more about that in my
matplotlib tutorials) or even your machine learning models.
The point is: in certain cases, when you have done a transformation on your dataframe, you have
to re-index the rows. For that, you can use the reset_index() method. For instance:
zoo.sort_values(by = ['water_need'], ascending = False).reset_index()
106
Nicer? For sure!
As you can see, our new dataframe kept the old indexes, too. If you want to remove them, just
add the drop = True parameter:
zoo.sort_values(by = ['water_need'], ascending = False).reset_index(drop = True)
107
Fillna
(Note: fillna is basically fill + na in one world. If you ask me, it’s not the smartest name, but this is
what we have.)
Let’s rerun the left-merge method that we have used above:
zoo.merge(zoo_eats, how = 'left')
108
Remember? These are all our animals. The problem is that we have NaN values for
lions. NaN itself can be really distracting, so I usually like to replace it with something more
meaningful. In some cases, this can be a 0 value, or in other cases a specific string value, but this
time, I’ll go with unknown. Let’s use the fillna() function, which basically finds and replaces
all NaN values in our dataframe:
zoo.merge(zoo_eats, how = 'left').fillna('unknown')
109
Note: since we know that lions eat meat, we could have written zoo.merge(zoo_eats, how =
'left').fillna('meat'), as well.
Test yourself
Okay, you’ve gotten through the article! Great job!
Here’s your final test task!
Let’s get back to our article_read dataset.
(Note: Remember, this dataset holds the data of a travel blog. If you don’t have the data yet, you can
download it from here. Or you can go through the whole download, open, store process step by step
by reading the first episode of this pandas tutorial.)
Download another dataset, too: blog_buy. You can do that by running these two lines in your
Jupyter Notebook:
!wget 46.101.230.157/dilan/pandas_tutorial_buy.csv
blog_buy = pd.read_csv('pandas_tutorial_buy.csv', delimiter=';', names = ['my_date_time', 'event',
'user_id', 'amount'])
110
The article_read dataset shows all the users who read an article on the blog, and
the blog_buy dataset shows all the users who bought something on the very same blog
between 2018-01-01 and 2018-01-07.
I have two questions for you:
TASK #1: What’s the average (mean) revenue between 2018-01-01 and 2018-01-
07 from the users in the article_read dataframe?
TASK #2: Print the top 3 countries by total revenue between 2018-01-01 and 2018-01-
07! (Obviously, this concerns the users in the article_read dataframe again.)
SOLUTION for TASK #1
The average revenue is: 1.0852
Here’s the code:
step_1 = article_read.merge(blog_buy, how = 'left', left_on = 'user_id', right_on = 'user_id')
step_2 = step_1.amount
step_3 = step_2.fillna(0)
result = step_3.mean()
result
Note: for ease of understanding, I broke this down into “steps” – but you could also bring all these
functions into one line.
A short explanation:
(On the screenshot, at the beginning, I included the two extra cells where I import pandas and
numpy, and where I read the csv files into my Jupyter Notebook.)
111
In step_1, I merged the two tables (article_read and blog_buy) based on
the user_id columns. I kept all the readers from article_read, even if they didn’t buy
anything, because 0s should be counted in to the average revenue value. And I removed
everyone who bought something but wasn’t in the article_read dataset (that was fixed in
the task). So all in all that led to a left join.
In step_2, I removed all the unnecessary columns, and kept only amount.
In step_3, I replaced NaN values with 0s.
And eventually I did the .mean() calculation.
SOLUTION for TASK #2
The code is:
step_1 = article_read.merge(blog_buy, how = 'left', left_on = 'user_id', right_on = 'user_id')
step_2 = step_1.fillna(0)
step_3 = step_2.groupby('country').sum()
step_4 = step_3.amount
step_5 = step_4.sort_values(ascending = False)
step_5.head(3)
112
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
113
How to Plot a Histogram in Python (Using Pandas)
Written by Tomi Mester on May 15, 2020
Last updated on October 27, 2020
Plotting a histogram in Python is easier than you’d think! And in this article, I’ll show you how.
I have a strong opinion about visualization in Python, which is: it should be useful and not pretty.
Why? Because the fancy data visualization for high-stakes presentations should happen in tools
that are the best for it: Tableau, Google Data Studio, PowerBI, etc… Creating charts and graphs
natively in Python should serve only one purpose: to make your data science tasks (e.g.
prototyping machine learning models) easier and more intuitive.
So in this tutorial, I’ll focus on how to plot a histogram in Python that’s:
fast
easy
useful
and yeah… probably not the most beautiful (but not ugly, either).
The tool we will use for that is a function in our favorite Python data analytics library
— pandas — and it’s called .hist()… But more about that in the article!
Download the code base!
Find the whole code base for this article (in Jupyter Notebook format) here:
Plot histograms in Python (GitHub link)
Download it from: here.
Before we get started…
In this article, I assume that you have some basic Python and pandas knowledge.
If you don’t, I recommend starting with these articles:
1. Python libraries and packages for Data Scientists
2. Learn Python from Scratch
3. Pandas Tutorial 1 (Basics)
4. Pandas Tutorial 2 (Aggregation and grouping)
5. Pandas Tutorial 3 (Data Formatting)
Also, this is a hands-on tutorial, so it’s the best if you do the coding part with me!
What is a histogram?
Start with the basics!
What is a histogram and how is it useful?
A histogram shows the number of occurrences of different values in a dataset. At first glance, it
is very similar to a bar chart.
It looks like this:
114
But a histogram is more than a simple bar chart.
Let me give you an example and you’ll see immediately why.
Let’s say that you run a gym and you have 250 clients. For some reason, you want to analyze
their heights. Good!
You have the individual data points – the height of each and every client in one big Python list:
height = [185, 172, 172, 169, 181, 162, 186, 171, 177, 174, 184, 163, 174, 173, 182, 169, 174,
170, 176, 179, 169, 182, 181, 179, 181, 171, 175, 170, 174, 179, 171, 173, 171, 170, 171, 175,
169, 177, 185, 180, 174, 170, 171, 186, 176, 172, 177, 188, 176, 179, 177, 173, 169, 173, 174,
179, 181, 181, 177, 181, 171, 183, 179, 174, 178, 175, 182, 185, 189, 167, 167, 172, 176, 181,
177, 163, 174, 180, 177, 180, 174, 174, 177, 178, 177, 176, 171, 178, 176, 182, 183, 177, 173,
115
172, 178, 176, 173, 176, 172, 180, 173, 183, 178, 179, 169, 177, 180, 170, 174, 176, 167, 177,
181, 170, 178, 168, 175, 166, 182, 178, 175, 171, 183, 187, 164, 183, 185, 178, 168, 181, 174,
172, 168, 179, 180, 172, 179, 169, 180, 176, 174, 175, 181, 180, 179, 176, 176, 179, 177, 180,
174, 161, 182, 189, 178, 175, 175, 175, 176, 169, 172, 170, 177, 174, 178, 174, 181, 177, 189,
164, 172, 181, 191, 174, 176, 174, 183, 174, 180, 174, 168, 177, 179, 183, 175, 172, 179, 177,
177, 175, 182, 178, 187, 182, 179, 166, 179, 178, 180, 182, 173, 180, 172, 187, 168, 165, 166,
170, 169, 187, 174, 167, 182, 172, 168, 181, 179, 173, 184, 176, 185, 179, 185, 176, 168, 190,
172, 174, 171, 174, 177, 177, 179, 186, 175, 168, 168, 172, 165, 180, 173, 174, 175, 167, 170,
180, 179, 173, 186, 168]
Note: it’s in centimeters, folks!
Looking at 250 data points is not very intuitive, is it?
As we’ve discussed in the statistical averages and statistical variability articles, you have to
“compress” these numbers into a few values that are easier to understand yet describe your
dataset well enough. These could be:
mean: 175.952
median: 176
mode: 174
standard deviation: 5.65
10% percentile: 168
90% percentile: 183
Based on these values, you can get a pretty good sense of your data…
But if you plot a histogram, too, you can also visualize the distribution of your data points.
For this dataset above, a histogram would look like this:
116
It’s very visual, very intuitive and tells you even more than the averages and variability measures
above. I love it!
Bins and ranges. A histogram is not the same as a bar chart!
You most probably realized that in the height dataset we have ~25-30 unique values. If you
simply counted the unique values in the dataset and put that on a bar chart, you would have
gotten this:
117
Bar chart that shows the frequency of unique values in the dataset
But when you plot a histogram, there’s one more initial step: these unique values will be
grouped into ranges. These ranges are called bins or buckets — and in Python, the default
number of bins is 10. So after the grouping, your histogram looks like this:
118
As I said: pretty similar to a bar chart — but not the same!
When is this grouping-into-ranges concept useful?
For instance when you have way too many unique values in your dataset. (In big data projects, it
won’t be ~25-30 as it was in our example… more like 25-30 *million* unique values.)
For instance, let’s imagine that you measure the heights of your clients with a laser meter and
you store first decimal values, too. Like this:
height = [185.7, 172.3, 172.8, 169.6, 181.2, 162.2, 186.5, 171.4, 177.9, 174.5, 184.8, 163.6,
174.1, 173.7, 182.8, 169.4, 175.0, 170.7, 176.3, 179.5, 169.4, 182.9, 181.4, 179.0, 181.4, 171.9,
175.3, 170.4, 174.4, 179.2, 171.9, 173.6, 171.9, 170.9, 172.0, 175.9, 169.3, 177.4, 186.0, 180.5,
174.8, 170.7, 171.5, 186.2, 176.3, 172.2, 177.1, 188.6, 176.7, 179.7, 177.8, 173.9, 169.1, 173.9,
119
174.7, 179.5, 181.0, 181.6, 177.7, 181.3, 171.5, 183.5, 179.1, 174.2, 178.9, 175.5, 182.8, 185.1,
189.1, 167.6, 167.3, 173.0, 177.0, 181.3, 177.9, 163.9, 174.2, 181.0, 177.4, 180.6, 174.7, 174.8,
177.1, 178.5, 177.2, 176.7, 172.0, 178.3, 176.7, 182.8, 183.2, 177.1, 173.7, 172.2, 178.5, 176.5,
173.9, 176.3, 172.3, 180.2, 173.3, 183.3, 178.4, 179.6, 169.4, 177.0, 180.4, 170.3, 174.4, 176.2,
167.8, 177.9, 181.1, 170.8, 178.1, 168.1, 175.8, 166.3, 182.7, 178.5, 175.9, 171.3, 183.6, 187.8,
164.9, 183.4, 185.8, 178.0, 168.8, 181.2, 174.9, 172.4, 168.6, 179.3, 180.8, 172.3, 179.1, 169.1,
180.8, 176.3, 174.9, 175.4, 181.2, 180.5, 179.2, 176.8, 176.5, 179.7, 177.4, 180.1, 174.1, 161.4,
182.2, 189.1, 178.6, 175.4, 175.2, 175.3, 176.1, 169.3, 172.9, 170.0, 177.5, 174.2, 179.0, 175.0,
181.9, 177.3, 189.1, 164.6, 172.1, 181.4, 191.2, 174.5, 176.3, 174.6, 184.0, 174.3, 180.1, 174.1,
168.4, 177.9, 179.0, 183.8, 175.3, 172.3, 179.4, 177.4, 177.7, 175.6, 183.0, 178.2, 187.4, 182.7,
180.0, 166.2, 179.6, 178.5, 180.9, 182.3, 173.6, 180.9, 172.6, 187.7, 168.0, 165.4, 166.1, 170.7,
169.3, 187.7, 174.0, 167.9, 182.7, 172.5, 168.6, 181.3, 179.7, 173.4, 184.4, 176.8, 185.7, 179.0,
185.4, 176.7, 168.7, 190.7, 172.7, 174.8, 171.8, 174.8, 177.5, 177.2, 180.0, 186.8, 175.3, 168.6,
168.9, 172.0, 166.0, 181.0, 173.0, 174.1, 176.0, 167.6, 170.8, 180.0, 179.7, 173.3, 186.9, 168.2]
This is the very same dataset as it was before… only one decimal more accurate.
But because of that tiny difference, now you have not ~25 but ~150 unique values. So if you
count the occurrences of each value and put it on a bar chart now, you would get this:
120
Ouch…
A histogram, though, even in this case, conveniently does the grouping for you. You get values
that are close to each other counted and plotted as values of given ranges/bins:
121
Beautiful… but more importantly: useful!
How to plot a histogram in Python (step by step)
Now that you know the theory, what a histogram is and why it is useful, it’s time to learn how to
plot one using Python. There are many Python libraries that can do so:
pandas
matplotlib
seaborn
…
122
But I’ll go with the simplest solution: I’ll use the .hist() function that’s built into pandas. As I said
in the introduction: you don’t have to do anything fancy here… You rather need a histogram
that’s useful and informative for you — and for your data science tasks.
Anyway, the .hist() pandas function is built on top of the original matplotlib solution. (See more
info in the documentation.) So the result and the visual you’ll get is more or less the same that
you’d get by using matplotlib… The syntax will be also similar but a little bit closer to the logic
that you got used to in pandas. So in my opinion, it’s better for your learning curve to get familiar
with this solution.
Either way, let’s see how this works!
Note: if you are looking for something eye-catching, check out the seaborn Python dataviz library.
Step #1: Import pandas and numpy, and set matplotlib
One of the advantages of using the built-in pandas histogram function is that you don’t have
to import any other libraries than the usual: numpy and pandas.
At the very beginning of your project (and of your Jupyter Notebook), run these two lines:
import numpy as np
import pandas as pd
Great! numpy and pandas are imported and ready to use.
And don’t forget to add the:
%matplotlib inline
line, either — so you can plot your charts into your Jupyter Notebook.
123
mu = 176 #mean
sigma = 6 #stddev
sample = 250
np.random.seed(1)
height_m = np.random.normal(mu, sigma, sample).astype(int)
Run them!
For now, you don’t have to know what exactly happened above. (I’ll write a separate article
about the np.random function.) Just know that this generated two datasets, with 250 data points
in each. And because I fixed the parameter of the random generator (with
the np.random.seed() line), you’ll get the very same numpy arrays with the very same data points
that I have.
In the height_f dataset you’ll get 250 height values of female clients of our hypothetical gym.
124
In the height_m dataset there are 250 height values of male clients.
125
Step #3: Prepare the data!
The more complex your data science project is, the more things you should do before you can
actually plot a histogram in Python.
Preparing your data is usually more than 80% of the job…
But in this simpler case, you don’t have to worry about data cleaning (removing duplicates, filling
empty values, etc.). You just need to turn your height_m and height_f data into a pandas
DataFrame.
Run this line:
gym = pd.DataFrame({'height_f': height_f, 'height_m': height_m})
Great:
We have the heights of female and male gym members in one big 250-row dataframe.
gym
126
[OPTIONAL] Basics: Plotting line charts and bar charts in Python using pandas
Before we plot the histogram itself, I wanted to show you how you would plot a line chart and a
bar chart that shows the frequency of the different values in the data set… so you’ll be able to
compare the different approaches.
And of course, if you have never plotted anything in pandas before, creating a simpler line chart
first can be handy.
To put your data on a chart, just type the .plot() function right after the pandas dataframe you
want to visualize. By default, .plot() returns a line chart.
If you plot() the gym dataframe as it is:
gym.plot()
you’ll get this:
127
Uhh. Messy.
On the y-axis, you can see the different values of the height_m and height_f datasets. And the x-
axis shows the indexes of the dataframe — which is not very useful in this case.
So let’s tweak this further!
To get what we wanted to get (plot the occurrence of each unique value in the dataset), we have
to work a bit more with the original dataset. Let’s add a .groupby() with a .count() aggregate
function. (I wrote more about these in this pandas tutorial.)
gym.groupby('height_m').count()
If you plot the output of this, you’ll get a much nicer line chart:
gym.groupby('height_m').count().plot()
128
frequency of values
This is closer to what we wanted… except that line charts are to show trends. If you want to
compare different values, you should use bar charts instead.
To turn your line chart into a bar chart, just add the bar keyword:
gym.groupby('height_m').count().plot.bar()
or:
gym.groupby('height_m').count().plot(kind='bar')
129
And of course, you should run this for the height_f dataset, separately:
gym.groupby('height_f').count().plot.bar()
130
This is how you visualize the occurrence of each unique value on a bar chart in Python…
But this is still not a histogram, right!?
So…
Step #4: Plot a histogram in Python!
Once you have your pandas dataframe with the values in it, it’s extremely easy to put that on a
histogram.
Type this:
gym.hist()
131
plotting histograms in Python
Yepp, compared to the bar chart solution above, the .hist() function does a ton of cool things for
you, automatically:
1. It does the grouping.
When using .hist() there is no need for the initial .groupby() function! .hist() automatically
groups your data into bins. (By default, into 10 bins.)
Note: again, “grouping into bins” is not the same as “grouping by unique values” — as a bin
usually contains a range of values.
2. It does the counting. (No need for .count() function either.)
3. It plots a histogram for each column in your dataframe that has numerical values in it.
So plotting a histogram (in Python, at least) is definitely a very convenient way to visualize the
distribution of your data.
If you want a different amount of bins/buckets than the default 10, you can set that as a
parameter. E.g:
gym.hist(bins=20)
132
Bonus: Plot your histograms on the same chart!
Sometimes, you want to plot histograms in Python to compare two different columns of your
dataframe.
In that case, it’s handy if you don’t put these histograms next to each other — but on the very
same chart.
It can be done with a small modification of the code that we have used in the previous section.
gym.plot.hist(bins=20)
133
Note: in this version, you called the .hist() function from .plot.
Anyway, since these histograms are overlapping each other, I recommend setting their
transparency to 70% by using the alpha parameter:
gym.plot.hist(bins=20, alpha=0.7)
134
So you can see both charts.
Conclusion
This is it!
Just as I promised: plotting a histogram in Python is easy… as long as you want to keep it simple.
You can make this complicated by adding more parameters to display everything more nicely.
But you don’t have to…
Anyway, these were the basics. Just use the .hist() or the .plot.hist() functions on the dataframe
that contains your data points and you’ll get beautiful histograms that will show you the
distribution of your data.
135
And don’t stop here, continue with the pandas tutorial episode #5 where I’ll show you how to
plot a scatter plot in pandas.
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
136
Pandas tutorial 5: Scatter plot with pandas and matplotlib
Written by Tomi Mester on June 11, 2020
Scatter plots are frequently used in data science and machine learning projects. In this pandas
tutorial, I’ll show you two simple methods to plot one. Both solutions will be equally useful and
quick:
one will be using pandas (more precisely: pandas.plot.scatter())
the other one using matplotlib (matplotlib.pyplot.scatter())
Let’s see them — and as usual: I’ll guide you through step by step.
Note: If you don’t know anything about pandas (or Python), you might want to start here:
1. Python libraries and packages for Data Scientists
2. Learn Python from Scratch
3. Pandas Tutorial 1 (Basics)
4. Pandas Tutorial 2 (Aggregation and grouping)
5. Pandas Tutorial 3 (Data Formatting)
6. Pandas Tutorial 4 (Plotting in pandas: Bar Chart, Line Chart, Histogram)
Download the code base!
This is a hands-on tutorial, so it’s best if you do the coding part with me!
You can also find the whole code base for this article (in Jupyter Notebook format) here: Scatter
plot in Python.
You can download it from: here.
What is a scatter plot? And what is it good for?
Scatter plots are used to visualize the relationship between two (or sometimes three) variables
in a data set. The idea is simple:
you take a data point,
you take two of its variables,
the y-axis shows the value of the first variable,
the x-axis shows the value of the second variable
Following this concept, you display each and every datapoint in your dataset. You’ll get
something like this:
137
Boom! This is a scatter plot. At least, the easiest (and most common) example of it.
This particular scatter plot shows the relationship between the height and weight of people from
a random sample. Again:
y-axis shows the height
x-axis shows the weight
and each blue dot represents a person in this dataset
So, for instance, this person’s (highlighted with red) weight and height is 66.5 kg and 169 cm.
138
How to read a scatter plot?
Scatter plots play an important role in data science – especially in building/prototyping machine
learning models. Looking at the chart above, you can immediately tell that there’s a strong
correlation between weight and height, right? As we discussed in my linear regression article,
you can even fit a trend line (a.k.a. regression line) to this data set and try to describe this
relationship with a mathematical formula.
Something like this:
139
Note: this article is not about regression machine learning models, but if you want to get started with
that, go here: Linear Regression in Python using numpy + polyfit (with code base)
This above is called a positive correlation. The greater is the height value, the greater is the
expected weight value, too. (Of course, this is a generalization of the data set. There are always
exceptions and outliers!)
But it’s also possible that you’ll get a negative correlation:
140
And in real-life data science projects, you’ll see no correlation often, too:
141
Anyway: if you see a sign of positive or negative correlation between two variables in a data
science project, that’s a good indicator that you found something interesting — something that’s
worth digging deeper into. Well, in 99% of cases it will turn out to be either a triviality, or a
coincidence. But in the remaining 1%, you might find gold!
Okay, I hope I set your expectations about scatter plots high enough.
It’s time to see how to create one in Python!
Scatter plot in pandas and matplotlib
As I mentioned before, I’ll show you two ways to create your scatter plot.
You’ll see here the Python code for:
142
a pandas scatter plot
and
a matplotlib scatter plot
The two solutions are fairly similar, the whole process is ~90% the same… The only difference is
in the last few lines of code.
Note: By the way, I prefer the matplotlib solution because I find it a bit more transparent.
I’ll guide you through these 4 steps:
1. Importing pandas, numpy and matplotlib
2. Getting the data
3. Preparing the data
4. Plotting a scatter plot
Step #1: Import pandas, numpy and matplotlib!
Just as we have done in the histogram article, as a first step, you’ll have to import the libraries
you’ll use. And you’ll also have to make a small tweak in your Jupyter environment.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
The first two lines will import pandas and numpy.
The third line will import the pyplot from matplotlib — also, we will refer to it as plt.
And %matplotlib inline sets your environment so you can directly plot charts into your Jupyter
Notebook!
Great!
143
Note: For now, you don’t have to know line by line what’s going on here. (I’ll write a separate article
about how numpy.random works.)
In the next step, we will push these data sets into pandas dataframes.
Step #3: Prepare the data!
Again, preparing, cleaning and formatting the data is a painful and time consuming process in
real-life data science projects. But in this tutorial, we are lucky, everything is prepared – the data
is clean – so you can push your height and weight data sets directly into a pandas dataframe
(called gym) by running this one line of code:
gym = pd.DataFrame({'height': height, 'weight': weight})
Note: If you want to experience the complexity of a true-to-life data science project, go and check out
my 6-week course: The Junior Data Scientist’s First Month!
Your gym dataframe should look like this.
144
Perfect: ready for putting it on a scatter plot!
Step #4a: Pandas scatter plot
Okay, all set, we have the gym dataframe. Let’s create a pandas scatter plot!
Now, this is only one line of code and it’s pretty similar to what we had for bar charts, line charts
and histograms in pandas…
It starts with: gym.plot …and then you simply have to define the chart type that you want to plot,
which is scatter(). But when plotting a scatter plot in pandas, you’ll always have to specify
the x and y values as parameters, too. (This could seem unusual because for bar and line charts, you
didn’t have to do anything similar to this.)
So the final line of code will be:
gym.plot.scatter(x = 'weight', y = 'height')
The x and y values – by definition – have to come from the gym dataframe, so you have to refer
to the column names: 'weight' and 'height'!
145
A quick comment: Watch out for all the apostrophes! I know from my live workshops that the
syntax might seem tricky at first. But you’ll get used to it after your 5th or 6th scatter plot, I
promise!
That’s it! You have plotted a scatter plot in pandas!
146
In my opinion, this solution is a bit more elegant. But from a technical standpoint — and for
results — both solutions are equally great.
Anyway, type and run these three lines:
x = gym.weight
y = gym.height
plt.scatter(x,y)
x = gym.weight
This line defines what values will be displayed on the x-axis of the scatter plot. It’s
the weight column again from the gym dataset. (Note: This is in pandas Series format… But
in this specific case, I could have added the original numpy array, too.)
y = gym.height
On the y-axis we want to display the gym.height values. (This is in pandas Series format,
too!)
plt.scatter(x,y)
And then this line does the plotting. Remember, you set plt at the very beginning of our
Jupyter notebook (import matplotlib.pyplot as plt) — and so plt refers
to matplotlib.pyplot! And the x and y values are parameters that have been defined in the
previous two lines.
Again: this is slightly different (and in my opinion slightly nicer) syntax than with pandas.
But the result is exactly the same.
147
Conclusion
This is how you make a scatter plot in pandas and/or in matplotlib. I think it’s fairly easy and I
hope you think the same. If you haven’t done so yet, check out my Python histogram tutorial,
too! If you have any questions, leave a comment below!
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
148
Linear Regression in Python using numpy + polyfit (with
code base)
Written by Tomi Mester on February 20, 2020
Last updated on March 09, 2020
I always say that learning linear regression in Python is the best first step towards machine
learning. Linear regression is simple and easy to understand even if you are relatively new to
data science. So spend time on 100% understanding it! If you get a grasp on its logic, it will serve
you as a great foundation for more complex machine learning concepts in the future.
In this tutorial, I’ll show you everything you’ll need to know about it: the mathematical
background, different use-cases and most importantly the implementation. We will do that in
Python — by using numpy (polyfit).
Note: This is a hands-on tutorial. I highly recommend doing the coding part with me! If you haven’t
done so yet, you might want to go through these articles first:
1. How to install Python, R, SQL and bash to practice data science!
2. Python libraries and packages for Data Scientists
3. Learn Python from Scratch
Download the code base!
Find the whole code base for this article (in Jupyter Notebook format) here:
Linear Regression in Python (using Numpy polyfit)
Download it from: here.
The mathematical background
Remember when you learned about linear functions in math classes?
I have good news: that knowledge will become useful after all!
Here’s a quick recap!
For linear functions, we have this formula:
y = a*x + b
In this equation, usually, a and b are given. E.g:
a=2
b=5
So:
y = 2*x + 5
Knowing this, you can easily calculate all y values for given x values.
E.g.
when x is… y is…
0 2*0 + 5 = 5
1 2*1 + 5 = 7
2 2*2 + 5 = 9
3 2*3 + 5 = 11
4 2*4 + 5 = 13
149
…
If you put all the x–y value pairs on a graph, you’ll get a straight line:
150
The a variable is often called slope because – indeed – it defines the slope of the red line.
The b variable is called the intercept. b is the value where the plotted line intersects the y-
axis. (Or in other words, the value of y is b when x = 0.)
This is all you have to know about linear functions for now…
But why did I talk so much about them?
Because linear regression is nothing else but finding the exact linear function equation (that is:
finding the a and b values in the y = a*x + b formula) that fits your data points the best.
Note: Here’s some advice if you are not 100% sure about the math. The most intuitive way to
understand the linear function formula is to play around with its values. Change the a and b variables
above, calculate the new x-y value pairs and draw the new graph. Repeat this as many times as
151
necessary. (Tip: try out what happens when a = 0 or b = 0!) By seeing the changes in the value pairs
and on the graph, sooner or later, everything will fall into place.
A typical linear regression example
Machine learning – just like statistics – is all about abstractions. You want to simplify reality so
you can describe it with a mathematical formula. But to do so, you have to ignore natural
variance — and thus compromise on the accuracy of your model.
If this sounds too theoretical or philosophical, here’s a typical linear regression example!
We have 20 students in a class and we have data about a specific exam they have taken. Each
student is represented by a blue dot on this scatter plot:
the X axis shows how many hours a student studied for the exam
the Y axis shows the scores that she eventually got
152
E.g. she studied 24 hours and her test result was 58%:
153
We have 20 data points (20 students) here.
By looking at the whole data set, you can intuitively tell that there must be a correlation between
the two factors. If one studies more, she’ll get better results on her exam. But you can see the
natural variance, too. For instance, these 3 students who studied for ~30 hours got very different
scores: 74%, 65% and 40%.
154
Anyway, let’s fit a line to our data set — using linear regression:
155
Nice, we got a line that we can describe with a mathematical equation – this time, with a linear
function. The general formula was:
y=a*x+b
And in this specific case, the a and b values of this line are:
a = 2.01
b = -3.9
So the exact equation for the line that fits this dataset is:
y = 2.01*x - 3.9
And how did I get these a and b values? By using machine learning.
156
If you know enough x–y value pairs in a dataset like this one, you can use linear regression
machine learning algorithms to figure out the exact mathematical equation (so
the a and b values) of your linear function.
Linear regression terminology
Before we go further, I want to talk about the terminology itself — because I see that it confuses
many aspiring data scientists. Let’s fix that here!
Okay, so one last time, this was our linear function formula:
y = a*x + b
The a and b variables:
The a and b variables in this equation define the position of your regression line and I’ve already
mentioned that the a variable is called slope (because it defines the slope of your line) and
the b variable is called intercept.
In the machine learning community the a variable (the slope) is also often called the regression
coefficient.
The x and y variables:
The x variable in the equation is the input variable — and y is the output variable.
This is also a very intuitive naming convention. For instance, in this equation:
y = 2.01*x - 3.9
If your input value is x = 1, your output value will be y = -1.89.
But in machine learning these x-y value pairs have many alternative names… which can cause
some headaches. So here are a few common synonyms that you should know:
input variable (x) – output variable (y)
independent variable (x) – dependent variable (y)
predictor variable (x) – predicted variable (y)
feature (x) – target (y)
See, the confusion is not an accident… But at least, now you have your linear regression
dictionary here.
How does linear regression become useful?
Having a mathematical formula – even if it doesn’t 100% perfectly fit your data set – is useful for
many reasons.
1. Predictions: Based on your linear regression model, if a student tells you how much she
studied for the exam, you can come up with a pretty good estimate: you can predict her
results even before she writes the test. Let’s say someone studied 20 hours; it means that
her predicted test result will be 2.01 * 20 - 3.9 = 36.3.
2. Outliers: If something unexpected shows up in your dataset – someone is way too far
from the expected range…
157
… let’s say, someone who studied only 18 hours but got almost 100% on the exam… Well, that
student is either a genius — or a cheater. But she’s definitely worth the teachers’ attention,
right? By the way, in machine learning, the official name of these data points is outliers.
And both of these examples can be translated very easily to real life business use-cases, too!
Predictions are used for: sales predictions, budget estimations, in manufacturing/production, in
the stock market and in many other places. (Although, usually these fields use more sophisticated
models than simple linear regression.)
Finding outliers is great for fraud detection. And it’s widely used in the fintech industry. (E.g.
preventing credit card fraud.)
158
The limitations of machine learning models
It’s good to know that even if you find a very well-fitting model for your data set, you have to
count on some limitations.
Note: These are true for essentially all machine learning algorithms — not only for linear regression.
Limitation #1: a model is never a perfect fit
As I said, fitting a line to a dataset is always an abstraction of reality. Describing something with a
mathematical formula is sort of like reading the short summary of Romeo and Juliet. You’ll get
the essence… but you will miss out on all the interesting, exciting and charming details.
Similarly in data science, by “compressing” your data into one simple linear function comes with
losing the whole complexity of the dataset: you’ll ignore natural variance.
But in many business cases, that can be a good thing. Your mathematical model will be simple
enough that you can use it for your predictions and other calculations.
Note: One big challenge of being a data scientist is to find the right balance between a too-simple and
an overly complex model — so the model can be as accurate as possible. (This problem even has a
name: bias-variance tradeoff, and I’ll write more about this in a later article.)
But a machine learning model – by definition – will never be 100% accurate.
Limitation #2: you can’t go beyond the range of your historical data
Many data scientists try to extrapolate their models and go beyond the range of their data.
For instance, in our case study above, you had data about students studying for 0-50 hours. The
dataset hasn’t featured any student who studied 60, 80 or 100 hours for the exam. These values
are out of the range of your data. If you wanted to use your model to predict test results for
these “extreme” x values… well you would get nonsensical y values:
E.g. your model would say that someone who has studied x = 80 hours would get:
y = 2.01*80 - 3.9 = 159% on the test.
159
…but 100% is the obvious maximum, right?
The point is that you can’t extrapolate your regression model beyond the scope of the data that
you have used creating it. Well, in theory, at least...
Because I have to admit, that in real life data science projects, sometimes, there is no way around
it. If you have data about the last 2 years of sales — and you want to predict the next month, you
have to extrapolate. Even so, we always try to be very careful and don’t look too far into the
future. The further you get from your historical data, the worse your model’s accuracy will be.
Linear Regression in Python
Okay, now that you know the theory of linear regression, it’s time to learn how to get it done in
Python!
160
Let’s see how you can fit a simple linear regression model to a data set!
Well, in fact, there is more than one way of implementing linear regression in Python. Here, I’ll
present my favorite — and in my opinion the most elegant — solution. I’ll use numpy and
its polyfit method.
We will go through these 6 steps:
1. Importing the Python libraries we will use
2. Getting the data
3. Defining x values (the input variable) and y values (the output variable)
4. Machine Learning: fitting the model
5. Interpreting the results (coefficient, intercept) and calculating the accuracy of the model
6. Visualization (plotting a graph)
Note: You might ask: “Why isn’t Tomi using sklearn in this tutorial?” I know that (in online tutorials at
least) Numpy and its polyfit method is less popular than the Scikit-learn alternative… true. But in my
opinion, numpy‘s polyfit is more elegant, easier to learn — and easier to maintain in
production! sklearn‘s linear regression function changes all the time, so if you implement it in
production and you update some of your packages, it can easily break. I don’t like that. Besides, the
way it’s built and the extra data-formatting steps it requires seem somewhat strange to me. In my
opinion, sklearn is highly confusing for people who are just getting started with Python machine
learning algorithms. (By the way, I had the sklearn LinearRegression solution in this tutorial… but I
removed it. That’s how much I don’t like it. So trust me, you’ll like numpy + polyfit better, too. :-))
Linear Regression in Python – using numpy + polyfit
Fire up a Jupyter Notebook and follow along with me!
Note: Find the code base here and download it from here.
STEP #1 – Importing the Python libraries
Before anything else, you want to import a few common data science libraries that you will use
in this little project:
numpy
pandas (you will store your data in pandas DataFrames)
matplotlib.pyplot (you will use matplotlib to plot the data)
Note: if you haven’t installed these libraries and packages to your remote server, find out how to do
that in this article.
Start with these few lines:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
(The %matplotlib inline is there so you can plot the charts right into your Jupyter Notebook.)
To be honest, I almost always import all these libraries and modules at the beginning of my
Python data science projects, by default. But apart from these, you won’t need any extra
libraries: polyfit — that we will use for the machine learning step — is already imported
with numpy.
161
STEP #2 – Getting the data
The next step is to get the data that you’ll work with. In this case study, I prepared the data and
you just have to copy-paste these two lines to your Jupyter Notebook:
students = {'hours': [29, 9, 10, 38, 16, 26, 50, 10, 30, 33, 43, 2, 39, 15, 44, 29, 41, 15, 24, 50],
'test_results': [65, 7, 8, 76, 23, 56, 100, 3, 74, 48, 73, 0, 62, 37, 74, 40, 90, 42, 58, 100]}
student_data = pd.DataFrame(data=students)
This is the very same data set that I used for demonstrating a typical linear regression example at
the beginning of the article. You know, with the students, the hours they studied and the test
scores.
Just print the student_data DataFrame and you’ll see the two columns with the value-pairs we
used.
162
the hours column shows how many hours each student studied
and the test_results column shows what their test results were
(So one line is one student.)
Of course, in real life projects, we instead open .csv files (with the read_csv function) or SQL
tables (with read_sql)… Regardless, the final format of the cleaned and prepared data will be a
similar dataframe.
So this is your data, you will fine-tune it and make it ready for the machine learning step.
Note: And another thought about real life machine learning projects… In this tutorial, we are working
with a clean dataset. That’s quite uncommon in real life data science projects. A big part of the data
scientist’s job is data cleaning and data wrangling: like filling in missing values, removing duplicates,
fixing typos, fixing incorrect character coding, etc. Just so you know.
STEP #3 – Defining the feature and target values
Okay, so we have the data set.
But we have to tweak it a bit — so it can be processed by numpy‘s linear regression function.
The next required step is to break the dataframe into:
input (x) values: this will be the hours column
and output (y) values: and this is the test_results column
polyfit requires you to define your input and output variables in 1-dimensional format. For that,
you can use pandas Series.
163
Let’s type this into the next cell of your Jupyter notebook:
x = student_data.hours
y = student_data.test_results
Okay, the input and output — or, using their fancy machine learning names,
the feature and target — values are defined.
At this step, we can even put them onto a scatter plot, to visually understand our dataset.
It’s only one extra line of code:
plt.scatter(x,y)
And I want you to realize one more thing here: so far, we have done zero machine learning… This
was only old-fashioned data preparation.
STEP #4 – Machine Learning: Linear Regression (line fitting)
We have the x and y values… So we can fit a line to them!
164
The process itself is pretty easy.
Type this one line:
model = np.polyfit(x, y, 1)
This executes the polyfit method from the numpy library that we have imported before. It needs
three parameters: the previously defined input and output variables (x, y) — and an integer,
too: 1. This latter number defines the degree of the polynomial you want to fit.
Using polyfit, you can fit second, third, etc… degree polynomials to your dataset, too. (That’s not
called linear regression anymore — but polynomial regression. Anyway, more about this in a later
article…)
But for now, let’s stick with linear regression and linear models – which will be a first degree
polynomial. So you should just put: 1.
When you hit enter, Python calculates every parameter of your linear regression model and
stores it into the model variable.
This is it, you are done with the machine learning step!
165
Let’s take a data point from our dataset.
x = 24
In the original dataset, the y value for this datapoint was y = 58. But when you fit a simple linear
regression model, the model itself estimates only y = 44.3. The difference between the two is
the error for this specific data point.
166
So the ordinary least squares method has these 4 steps:
1) Let’s calculate all the errors between all data points and the model.
167
2) Let’s square each of these error values!
3) Then sum all these squared values!
4) Find the line where this sum of the squared errors is the smallest possible value.
That’s OLS and that’s how line fitting works in numpy polyfit‘s linear regression solution.
STEP #5 – Interpreting the results
Okay, so you’re done with the machine learning part. Let’s see what you got!
First, you can query the regression coefficient and intercept values for your model. You just have
to type:
model
Note: Remember, model is a variable that we used at STEP #4 to store the output of np.polyfit(x, y, 1).
168
The output is:
array([ 2.01467487, -3.9057602 ])
These are the a and b values we were looking for in the linear function formula.
2.01467487 is the regression coefficient (the a value) and -3.9057602 is the intercept
(the b value).
So we finally got our equation that describes the fitted line. It is:
y = 2.01467487 * x - 3.9057602
If a student tells you how many hours she studied, you can predict the estimated results of her
exam. Quite awesome!
You can do the calculation “manually” using the equation.
But there is a simple keyword for it in numpy — it’s called poly1d():
predict = np.poly1d(model)
hours_studied = 20
predict(hours_studied)
The result is: 36.38773723
Note: This is the exact same result that you’d have gotten if you put the hours_studied value in the
place of the x in the y = 2.01467487 * x - 3.9057602 equation.
So from this point on, you can use these coefficient and intercept values – and
the poly1d() method – to estimate unknown values.
And this is how you do predictions by using machine learning and simple linear regression in
Python.
Well, okay, one more thing…
There are a few methods to calculate the accuracy of your model. In this article, I’ll show you
only one: the R-squared (R2) value. I won’t go into the math here (this article has gotten pretty
169
long already)… it’s enough if you know that the R-squared value is a number between 0 and 1.
And the closer it is to 1 the more accurate your linear regression model is.
Unfortunately, R-squared calculation is not implemented in numpy… so that one should be
borrowed from sklearn (so we can’t completely ignore Scikit-learn after all :-)):
from sklearn.metrics import r2_score
r2_score(y, predict(x))
170
Here’s a quick explanation:
x_lin_reg = range(0, 51)
This sets the range you want to display the linear regression model over — in our case it’s
between 0 and 50 hours.
y_lin_reg = predict(x_lin_reg)
This calculates the y values for all the x values between 0 and 50.
plt.scatter(x, y)
This plots your original dataset on a scatter plot. (The blue dots.)
171
plt.plot(x_lin_reg, y_lin_reg, c = 'r')
And this line eventually prints the linear regression model — based on
the x_lin_reg and y_lin_reg values that we set in the previous two lines. (c = 'r' means that
the color of the line will be red.)
Nice, you are done: this is how you create linear regression in Python using numpy and polyfit.
This was only your first step toward machine learning
You are done with building a linear regression model!
But this was only the first step. In fact, this was only simple linear regression. But there
is multiple linear regression (where you can have multiple input variables), there
is polynomial regression (where you can fit higher degree polynomials) and many many more
regression models that you should learn. Not to speak of the different classification models,
clustering methods and so on…
Here, I haven’t covered the validation of a machine learning model (e.g. when you break your
dataset into a training set and a test set), either. But I’m planning to write a separate tutorial
about that, too.
Anyway, I’ll get back to all these, here, on the blog!
So stay tuned!
Conclusion
Linear regression is the most basic machine learning model that you should learn.
If you understand every small bit of it, it’ll help you to build the rest of your machine learning
knowledge on a solid foundation.
Knowing how to use linear regression in Python is especially important — since that’s the
language that you’ll probably have to use in a real life data science project, too.
This article was only your first step! So stay with me and join the Data36 Inner Circle (it’s free).
If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester
172