50% found this document useful (2 votes)

778 views

Python Extra Tutorial

The document provides information on how to use Jupyter Notebook and introduces Python basics concepts including variables, data types, and data structures. It discusses running Python commands in Jupyter Notebook cells and auto-completion features. The document then covers key Python concepts such as variables and assigning values, data types (string, integer, float, Boolean), and arithmetic, comparison, and logical operators. It introduces common Python data structures like lists, tuples, and dictionaries to store multiple related values.

Uploaded by

JohnMedina

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

778 views

Python Extra Tutorial

Uploaded by

JohnMedina

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 172

How to use Jupyter Notebook

Let’s recap how to use Jupyter Notebook!

1. Type your Python command! It can be a multi-line command too – if you hit return/enter, it
won’t run, it will just start a new line in the same cell!

2. Hit SHIFT + ENTER to run your Python command!

3. Start typing and hit TAB! If it’s possible, Jupyter will auto-complete your expression (eg. for
Python commands or for variables that you have already defined). If there is more than one
possibility, you can choose from a drop-down menu.

Python Basics
Great! You have everything from the technical side to start coding in Python! Now this tutorial will start
off with the base concepts that you must learn before we go into how to use Python for Data Science.
The six base concepts will be:
1. Variables and data types
2. Data Structures in Python
3. Functions and methods
4. If statements
5. Loops
6. Python syntax essentials

1
Python Basics 1: Variables and Data types
In Python we like to assign values to variables. Why? Because it makes our code better — more flexible,
reusable and understandable. At the same time one of the trickiest things in coding is exactly this
“assignment concept.” When we refer to something, that refers to something, that refers to something…
well, understanding that needs some brain capacity. But don’t you worry, you will get used to it – and you
will love it!
Let’s see how it works!

Say we have a dog (‘Freddie’), and we would like to store some of his attributes (name, age,
is_vaccinated, year_of_born, etc.) of this dog in Python variables! We will type this into a Jupyter
notebook cell:

dog_name = 'Freddie'
age = 9
is_vaccinated = True
height = 1.1
birth_year = 2001

Note: we could have done this one per cell. But this all-in-one solution was easier and more elegant.
From now on, if we type these variables, the assigned values will be returned:

Just like in SQL, in Python we have different data types.

For instance the dog_name variable holds a string: 'Freddie'. In Python 3 a string is a sequence of
Unicode characters (eg. numbers, letters, punctuation, etc.), so it can have numbers or exclamation marks
or almost anything (eg. ‘R2-D2’ is a valid string). In Python it’s super easy to identify a string as it’s
usually between quotation marks.
The age and the birth_year variables store integers (9 and 2001), which is a numeric Python data type.
Another numeric data type is float, in our example: height, which is 1.1.
The is_vaccinated’s True value is a so called Boolean value. Booleans can be only True or False.
Summarized in a table:

Variable Name Value Data Type

dog_name 'Freddie' str (short for string)

2
age 9 int (short for integer)

is_vaccinated True bool (short for Boolean)

height 1.1 float (short for floating)

birth_year 2001 int (short for integer)

There are many more data types, but as a start, knowing these four will good enough and the rest will
come along the way.
It’s important to know that in Python every variable is overwritable. Eg. if we now run:
dog_name = 'Eddie'
in our Jupyter Notebook, our dog won’t be Freddie any more…

Python Variables – Basic Operators

You have just learned about variables. It’s time to play around with them!
Let’s define two new variables a and b:
a=3
b=4

What we can do with a and b? Well, first of all, a bunch of basic arithmetic operations! It’s nothing
special, you could have found out these by common sense, but just in case, here’s the list:

Operator What does it do? Result in our example

3
a+b Adds a to b 7

a-b Subtract b from a -1

a*b Multiply a by b 12

a/b Divide a by b 0.75

b%a Divides b by a and returns remainder 1

a ** b a raised to the power of b 81

And how it looks in Jupyter:

Note: try it for yourself with your values in your Jupyter Notebook! It’s fun!
We can use some variables with comparison operators. The results will always be Boolean values!
(Remember? Booleans can be only True or False.) a and b are still 3 and 4.

4
Operator What does it do? Result in our example

a>b Evaluate if a is greater than b False

a<b Evaluate if a is less than b True

a == b Evaluate if a equals b False

In the notebook:

And eventually we can use logical operators on our variables!

Let’s define c and d first:
c = True
d = False
Operator What does it do? Result in our example

c and d True if both c and d are True False

c or d True if either c or d is True True

not c The opposite of c False

5
This is easy and maybe less exciting, but again: just start to type this into your notebook, run your
commands and start to combine things – and it’s gonna be much more fun!
Speaking of which! Spice things up with some exercises!

Test yourself #1
Here are some new variables:
a=1
b=2
c=3
d = True
e = 'cool'
What will be the returned data type and the exact result of this operation?
a == e or d and c > b
Note: First try to find it out without typing it into Python – then check if you have guessed right!
.
.
.
The answer is: it’s gonna be a Boolean and it will be True.
Why? Because:
 a == e is False – as 1 is not equal to ‘cool’
 d is True by definition
 c > b is True, because 3 is greater than 2
So a == e or d and c>b translated is: False or True and True, which is True.

Test yourself #2
Use the variables from the previous assignment:

6
a=1
b=2
c=3
d = True
e = 'cool'
But this time try to figure out the result of this slightly modified expression:
not a == e or d and not c > b
Uh-oh, wait a minute! There is a trick here! To give a proper answer you have to know one more rule!
The evaluation order of the logical operators is: 1. not 2. and 3. or
.
.
.
Here’s the solution: True.
Why?
Let’s see! Using the previous exercise’s logic, this is what we have:
not False or True and not True
As we have discussed, the first logical operator evaluated is the not. After firing all the nots, this is what
we have:

True or True and False

The second step is to evaluate the and operator. Translated it’s:
True or (True and False), which leads to True or False.
And the last step is the or:
True or False –» True

Conclusion
Done with episode 1!
Did you realize that you have just started to code in Python 3? Wasn’t it easy and fun?
Well, good news: the rest of Python is just as easy as this was. The difficulty will come from the
combination of these simple things… But that’s why learning the basics very well is so important!
So stay with me – in the next chapter of “Python for Data Science” I’ll introduce the most important Data
Structures in Python!
 If you want to learn more about how to become a data scientist, take my 50-minute video
course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.
Cheers,
Tomi Mester

7
Python Data Structures (Python for Data Science Basics #2)
Why care about Python Data Structures?
Imagine that you have a book on your desk. I have one on mine: P. & A. Bruce: Practical Statistics
for Data Scientists. If I want to store this info in Python, I can put it into a variable.

8
my_book = "Practical Statistics for Data Scientists"
Done!
But hey, I just missed two more books on the other side of my desk! Dan Brown: Digital
Fortress and George R. R. Martin: A Game of Thrones. How do I store these two new pieces of
information… Maybe I can set up two new variables:
my_book2 = "Digital Fortress"
my_book3 = "A Game of Thrones"
Wait a minute! I’ve just realized I have a whole bookshelf behind me…
Are you seeing the problem? Sometimes in Python we need to store relevant information
together in one object – instead of several small variables.
This is why we have Data Structures!

Python Data Structures

There are three major Python data structures:
 Lists.
book_list = ['A Game of Thrones', 'Digital Fortress', 'Practical Statistics for Data
Scientists']
 Tuples.
book_tuple = ('A Game of Thrones', 'Digital Fortress', 'Practical Statistics for Data
Scientists')
 Dictionaries
book_dictionary = {'George R. R. Martin': 'A Game of Thrones', 'Dan Brown': 'Digital
Fortress', 'A. & P. Bruce': 'Practical Statistics for Data Scientists'}
All three are good for different things and you have to use them slightly differently…
Let’s dig into the practical details!

9
Python Data Structures #1: List

Start with the simplest one: Python lists.

A list is a sequence of values. Basically, it’s data put into brackets and separated by commas. An
easy example – a list of integers:
[3, 4, 1, 4, 5, 2, 7]

It’s important to know that in Python, a list is an object – and generally speaking it’s treated
like any other data type (e.g. integers, strings, booleans, etc.). This means that you can assign
your list to a variable, so you can store and make it easier to access:
my_first_list = [3, 4, 1, 4, 5, 2, 7]
my_first_list

A list can hold every other type of data, not just integers – strings, Booleans, even other lists.
Interesting, huh? Do you remember Freddie, the dog from the previous article?

You can store his attributes in one list instead of 5 different variables:
dog = ['Freddie', 9, True, 1.1, 2001]

Now let’s say that Freddie has two belongings: a bone and a little ball. We can store those
belongings as a list inside our first list.
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]

Actually we can do this list-in-a-list thingy infinite times – and believe it or not, this simple
concept (the official name is “nested lists,” by the way) will be essential when it comes to the
actual Data Science part of Python – e.g. when we create some multidimensional numpy arrays to
run correlation analyses… but let’s not get into it yet! The only thing you should remember is that
you can store lists in lists.

Or try this:
sample_matrix = [[1, 4, 9], [1, 8, 27], [1, 16, 81]]

10
Do you feel scientific? You should, because you have just created a 3-by-3 2D matrix.
How to access a specific element of a Python list?
Now that we have stored these values, it’s really essential to know how to access them in the
future. As you have already seen, you can get the whole Python list returned if you type the
right variable name.
E.g.
dog

But how do you call one particular item from your list? Firstly, think a bit about how you can
refer to a value in theory… The only thing that comes into play is the position of the value. E.g. if
you want to call the first element on the dog list, you have to type the name of the list and the
number of the element between brackets, like this: [1]. Try this:
dog[1]

What??? 9 was the second element on the list, not the first. Well, not in Python… Python uses
so-called “zero-based indexing”, which means that the first element’s number is [0], the second
is [1], the third is [2] and so on. This is something you have to keep in mind, when working with
Python Data Structures.
Note: “But why is that?” Hmm, tough topic! I don’t dare to say, “because of nerds…” So instead, I’ll just
link this nice open letter by Prof Dijkstra from
1982: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html

Anyway, don’t think too much about this… Just accept and apply this strange rule!
But here’s a detailed example!
Freddie the dog:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]

11
Try to print all the list elements one by one:
dog[0]
dog[1]
dog[2]
dog[3]
dog[4]
dog[5]

Twisted… But you will get used to it!

How to access a specific element of a nested Python list

One more thing about Freddie! We want to print his belongings one by one!
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
Can you find out how to get the ‘bone’ element, which is located in a nested list? Actually it’s
super-intuitive.
It’s gonna be the zeroth element of our fifth element! The syntax is:
dog[5][0]
That’s it:

12
If this is not 100% clear yet, I suggest playing around a bit with the sample_matrix = [[1, 4, 9], [1,
8, 27], [1, 16, 81]] data set and you will learn the trick!

How to access multiple elements of a Python list

One more trick you might wanna know about! You can use a colon between two numbers in
your brackets, so you can get a sequence of list items.
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
dog[1:4]

We’ll get back to this feature later in details!

…
This is everything you need to know about Python lists for now!

Python Data Structures #2: Tuples

What is a Python tuple? First of all: as a junior/aspiring Data Scientist, you don’t have to care too
much about tuples. If you want you can even skip this section.
If you have stayed:
A Python tuple is almost the same as a Python list, with a few small differences.
1. Syntax-wise: when you set up a tuple, you won’t use brackets, but parentheses.
List:
book_list = ['A Game of Thrones', 'Digital Fortress', 'Practical Statistics for Data
Scientists']
Tuple:
book_tuple = ('A Game of Thrones', 'Digital Fortress', 'Practical Statistics for Data
Scientists')

2. A Python list is mutable – so you can add, remove and change items in it. On the other
hand, a Python tuple is immutable, so once it’s set up, it’s sort of “set in stone.” This
strictness can be handy in some cases to make your code safer.
3. Python tuples are slightly faster than Python lists with the same calculations.
Other than that, you can use a tuple pretty much the same way as a list. Even returning an item
happens via the same bracket frames method. (Try: book_tuple[1] for your freshly created tuple.)

Again: none of the above will be your concern when you just start off with Python coding, but
it’s good to know at least this bit about tuples.

13
Python Data Structures #3: Dictionaries

Dictionaries are a whole different story. They are actually very different from lists – and very
commonly applied and useful in data science projects.
The main concept of dictionaries is that for every value you have a unique key. Take a look at
Freddie the dog again:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
These are the values that we want to store about a dog. In a dictionary you can attribute a key
for each of these values, so you can understand better that what value stands for what.

dog_dict = {'name': 'Freddie', 'age': 9, 'is_vaccinated': True, 'height': 1.1, 'birth_year': 2001,
'belongings': ['bone', 'little ball']}

As you can see the output is already formatted by Python… For better understanding you can do
the same for yourself on the first hand – let’s put the key-value pairs into new lines:
dog_dict = {'name': 'Freddie',
'age': 9,
'is_vaccinated': True,
'height': 1.1,
'birth_year': 2001,
'belongings': ['bone', 'little ball']}

As you can see, a nested list (belongings in this example) as a value in a dictionary is not a
problem.
And this is how a Python dictionary looks!

How to access a specific element of a Python dictionary

Here’s the most important rule to remember when it comes to accessing any element of any kind
of Python data structure: whether it’s a list, a tuple or a dictionary, you can print a specific item
by typing the name of your data structure (eg. dog) and the unique identifier of the element
between brackets (eg. [1]).
The same goes for dictionaries.
The only difference is that while in lists and tuples the unique identifier was the number of the
element – in a dictionary it’s the key. Try this:

dog_dict['name']

14
Note 1: maybe you are wondering whether you can still use a number to call a dictionary value. It’s not
possible, because Python dictionaries are unordered by definition – this means none of the key-value
pairs have a number in the dictionary. You can check this right away if you put in a new dictionary –
when you call it back to your screen, it’ll change your original order to alphabetical order. (Check
above!)
Note 2: maybe you are also wondering if you can return a key by inputting a value and not just a value
by inputting a key. Bad news: it’s not possible – Python is simply not made for that.

Test yourself!
Tadaaa! End of the article! It’s time to test yourself! You have learned a lot of important new
things about Python Data Structures today. If you haven’t been doing the “Test yourself”
sections in my articles, please make an exception for this one. Python Data Structures are
something that you will use all the time when you work as a Data Scientist, so do yourself a favor

and practice a bit to understand the topic 100%!

Here’s the exercise:
1. Copy-paste this super-nested (and super nasty) Python list-dictionary mutant into your
Jupyter notebook and put it into a variable called test!
2. test = [{'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'},
3. 1000,
4. 2000,
5. 3000,
6. ['hat', 't-shirt', 'jeans', {'socks1': 'red', 'socks2': 'blue'}]]
7. Solve these 6 small assignments – by printing specific items from the list/dictionary
above! It’ll start easy, then it’s gonna be harder and harder!
Exercise #1: Return 2000 on your screen!
Exercise #2: Return the dictionary of the cities and states on your screen!
(This: {'Arizona': 'Phoenix', 'California': 'Sacramento', 'Hawaii': 'Honolulu'})
Exercise #3: Return the list of the clothes on your screen! (This: ['hat', 't-shirt', 'jeans',
{'socks1': 'red', 'socks2': 'blue'}])
Exercise #4: Return the word 'Phoenix' on your screen!
Exercise #5: return the word 'jeans' on your screen!
Exercise #6: Return the word 'blue' on your screen!
.
.
.
Solutions
EXERCISE #1
test[2] –» The only trick here is the zero-based indexing, so in our list 2000 is in the 2nd place
in Python terms.
EXERCISE #2
test[0] –» This will print the whole dictionary from our main list.
EXERCISE #3
test[4] –» Same as the previous two – it will print the nested list.
EXERCISE #4

15
test[0]['Arizona'] –» This is basically the next step of exercise #2 – we are
calling the 'Phoenix' value with its key: 'Arizona'.

EXERCISE #5
test[4][2] –» And this one is related to exercise #3 – referring to'jeans' by its number – don’t
forget the zero-based indexing.
EXERCISE #6
test[4][3]['socks2'] –» And one more step – calling the item of a dictionary of a nested list
within a list – by its key: 'socks2'.

Conclusion
Nice. Job. You are done with another Python tutorial! This is almost everything you have to
know about Python Data Structures! Well, in fact, there will be a lot of small, but important
details (e.g. how to add, remove and change elements in a list or in a dictionary)… but to get
there we need to talk a little bit about Python functions and methods, and some other exciting
things first! Continue here:
Python Functions and Methods
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

16
Python Built-in Functions and Methods (Python for Data
Science Basics #3)
What are Python functions and methods?
Let’s start with the basics. Say we have a variable:
a = 'Hello!'
Here’s a simple example of a Python function:
len(a)
Result: 6
And an example for a Python method:
a.upper()
Result: 'HELLO!'

So what are Python functions and methods? In essence they transform something into
something else. In this case the input was 'Hello!' and the output was the length of this string (6),
and then the capitalized version: 'HELLO!'. Of course, these are not the only 2 functions you can
use: there are plenty of them. Combining them will help you in every part of your data project –
from data cleaning to machine learning. Everything.

Built-in vs. user-defined functions and methods

The cool thing is that besides the long list of built-in functions/methods, you can create your
own too! Also, you will see that when you download, installing and importing different Python
libraries, they will come with extra functions and methods as well. So there are indeed infinite
possibilities. I’ll get back to this topic later. For now, let’s focus on the built-in things.

The most important built-in Python functions for data projects

Python functions work very simply. You call the function and specify the required arguments,
then it will return the results. The type of the argument (e.g. string, list, integer, boolean, etc…)
can be restricted (e.g. in some cases it has to be an integer), but in most cases it can be multiple
value types.

Let’s take a look at the most important built-in Python functions:

17
print()
We have already used print(). It prints your stuff to the screen.
Example: print("Hello, World!")

abs()
returns the absolute value of a numeric value (e.g. integer or float). Obviously it can’t be a string.
It has to be a numeric value.
Example: abs(-4/3)

round()
returns the rounded value of a numeric value.
Example: round(-4/3)

min()
returns the smallest item of a list or of the typed-in arguments. It can even be a string.
Example 1: min(3,2,5)
Example 2: min('c','a','b')

max()

Guess, what! It’s the opposite of min().

sorted()
It sorts a list into ascending order. The list can contain strings or numbers.
Example:
a = [3, 2, 1]
sorted(a)

18
sum()
It sums a list. The list can have all types of numeric values, although it handles floats… well, not
smartly.
Example1:
a = [3, 2, 1]
sum(a)
Example1:
b = [4/3, 2/3, 1/3, 1/3, 1/3]
sum(b)

len()
returns the number of elements in a list or the number of characters in a string.
Example: len('Hello!')

type()
returns the type of the variable.
Example 1:
a = True
type(a)
Example 2:
b=2
type(b)

19
These are the built-in Python functions that you will use quite regularly. If you want to see all of
them, here’s the full list: https://docs.python.org/3/library/functions.html
But I’ll also show you more in my upcoming tutorials.

The most important built-in Python methods

Most of the Python methods are applicable only for a given value type. Eg. .upper() works with
strings, but doesn’t work with integers. And .append() works with lists only and doesn’t work
with strings, integers or booleans. So I’ll break down the methods by value type!

Methods for Python Strings

The string methods are usually used during the data cleaning phase of the data project. E.g.
imagine that you collect data about what people are searching for on your ecommerce website.
And you find these strings: 'mug', 'mug ', 'Mug'. You know that this is the same, but to let Python
know too, you should handle this situation! Let’s see the most important string methods in

Python:

a.lower()
returns the lowercase version of a string.
Example:
a = 'MuG'
a.lower()

a.upper()
the opposite of lower()
a.strip()
if the string has whitespaces at the beginning or at the end, it removes them.
Example:
a = ' Mug '

20
a.strip()

a.replace('old', 'new')
replaces a given string with another string. Note that it’s case sensitive.
Example:
a = 'muh'
a.replace('h','g')

a.split('delimiter')
splits your string into a list. Your argument specifies the delimiter.
Example:
a = 'Hello World'
a.split(' ')
Note: in this case the space is the delimiter.

'delimiter'.join(a)
It joins elements of a list into one string. You can specify the delimiter again.
Example:
a = ['Hello', 'World']
' '.join(a)

Okay, that’s it for the most important string methods in Python.

21
Methods for Python Lists

Do you remember the last article, when we went through the Python data structures? Let’s talk a
little bit about them again. Last time we discussed how to create a list and how to access its
elements. But I haven’t told you about how to modify a list. Any tips? Yes, you will need the
Python List methods!
Let’s bring back our favorite Python Dog, Freddie:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
Let’s see how we can modify this list!

a.append(arg)
The .append() method adds an element to the end of our list. In this case, let’s say we want to
add the number of legs Freddie has (which is 4).
Example:
dog.append(4)
dog

a.remove(arg)
If we want to remove the birth year, we can do it using the .remove() method. We have to
specify the element that we want to remove and Python will remove the first item with that
value from the list.
dog.remove(2001)
dog

a.count(arg)
returns the number of the specified value in a list.
Example:
dog.count('Freddie')

22
a.clear()
removes all elements of the list. It will basically delete Freddie. No worries, we will get him back.
Example:
dog.clear()
dog

By the way, here you can find the full list of list methods in
Python: https://docs.python.org/3/tutorial/datastructures.html

Methods for Python Dictionaries

As with the lists, there are some important dictionary functions to learn about.
Here’s Freddie again (see, I told you he’d be back):
dog_dict = {'name': 'Freddie',
'age': 9,
'is_vaccinated': True,
'height': 1.1,
'birth_year': 2001,
'belongings': ['bone', 'little ball']}

dog_dict.keys()
will return all the keys from your dictionary.

23
dog_dict.values()
will return all the values from your dictionary.

dog_dict.clear()
will delete everything from your dictionary.

Note:
Adding an element to a dictionary doesn’t require you to use a method; you have to do it by
simply defining a key-value pair like this:
dog_dict['key'] = 'value'
Eg.
dog_dict['name'] = 'Freddie'

Okay, these are all the methods you should know for now! We went through string, list and
dictionary Python methods!
It’s time to test yourself!

Test yourself!
For this exercise you will have to use not only what you have learned today, but what you have
learned about Python Data Structures and variable types too! Okay, let’s see:
1. Take this list:
test_yourself = [1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]
2. Calculate the mean of the list elements – by using only those things that you have read in
this and the previous articles!
3. Calculate the median of the list elements – by using only those things that you have read
in this and the previous articles!
.
.
.
And the solutions are:
2) sum(test_yourself) / len(test_yourself)

24
Where the sum() sums the numbers and the len() counts the elements. The division of those
will return the mean. The result is: 2.909090

3) test_yourself[round(len(test_yourself) / 2) - 1]
We are lucky to have a list with an odd number of elements.

Note: this formula won’t work for a list with an even number of elements.

len(test_yourself) / 2 will basically tell us where in the list we should look for our middle number
– which will be the median. The result is 5.5, but in fact the result of len() / 2 will always be less
by 0.5 than our exact number – when the list has odd number of elements (check it out for a 3 or
5-element list too). So let’s round this 5.5 up to 6 by using round(len(test_yourself) / 2). That’s
right: we can put a function into a function. Then subtract one because of the zero-based
indexing: round(len(test_yourself) / 2) - 1
And eventually use this result as the index of the list: test_yourself[round(len(test_yourself) / 2) -
1] or replace it with the exact number: test_yourself[5]. The result is: 3.

25
What’s the difference between Python functions and methods?
After reading this far in the article, I bet you have this question: “Why on Earth do we have both
functions and methods, when they practically do the same thing?”
I remember that when I started learning Python, I had a hard time answering this question. This
is still the most confusing topic for newcomers in the Python-world… The full answer is very
technical and you are not there yet. But here’s a little help for you to avoid confusion.
Firstly, start with the obvious. There is a clear difference in the syntax:
A function looks like this: function(something)
And a method looks like this: something.method()
(Look at the examples above!)
So why do we have both methods and functions in Python? The official answer is that there is a
small difference between them. Namely: a method always belongs to an object (e.g. in
the dog.append(4) method .append() needed the dog object to be applicable), while a function
doesn’t necessarily. To make this answer even more twisted: a method is in fact nothing else but
a specific function. Got it? All methods are functions, but not all functions are methods!
If this makes no sense to you (yet), don’t you worry. I promise, the idea will grow on you as you
use Python more and more – especially when you start to define your own functions and
methods.
But just in case, here’s a little extra advice from me:
In the beginning, learning Python functions and methods will be like learning the articles (der,
die, das) of the German language. You have to learn the syntax, use it the way you have learned
and that’s it.
Just like in German, there are some general rules of thumb that can help you recall things. The
main one is that functions are usually applicable for multiple type of objects, while methods are
not. E.g. sorted() is a function and it works with strings, lists, integers, etc. While .upper() is a
method and it only works with strings.
But again: my general advice here is that you should not put too much effort into understanding
the difference between methods and functions at this point; just learn the ones I mentioned in
this article and you’ll be a happy Python user.
Conclusion
Great, you have learned 20+ Python methods and functions. This is a good start, but remember:
these are only the basics. In the next episodes, we will rapidly extend this list by importing new
data science Python libraries with new functions and new methods!
As a next step, let’s learn a bit about loops and if statements! Here is the link to continue: Python
If Statements (Explained).
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

26
Python If Statements Explained (Python for Data Science
Basics #4)
Written by Tomi Mester on January 8, 2018
Last updated on August 03, 2019

We use if statements in our everyday life all the time – even if our everyday life is not written in
Python. If the light is green then I’ll cross the road; otherwise I’ll wait. If the sun is up then I’ll get
out of bed; otherwise I’ll go back to sleep. Okay, maybe it’s not this direct, but when we take
actions based on conditions, our brain does what a computer would do: evaluate the conditions
and act upon the results. Well, a computer script doesn’t have a subconscious mind, so for
practicing data science we have to understand how an if statement works and how we can apply
it in Python!

Python if statements basics

The logic of an if statement is very easy.

Let’s say we have two values: a = 10 and b = 20. We compare these two values: a == b. This
comparison has either a True or a False output. (Test it in your Jupyter Notebook!)

We can go even further and set a condition: if a == b is True then we print 'yes'. If it’s False then

we print 'no'. And that’s it, this is the logic of the Python if statements. Here’s the syntax:
a = 10
b = 20
if a == b:
print('yes')

27
else:
print('no')

Run this mini script in your Jupyter Notebook! The result will be (obviously): no.
Now, try the same – but set b to 10!
a = 10
b = 10
if a == b:
print('yes')
else:
print('no')
The returned message is yes.

28
Python if statement syntax
Let’s take a look at the syntax, because it has pretty strict rules.
The basics are simple:

You have:
1. an if keyword, then
2. a condition, then
3. a statement, then
4. an else keyword, then
5. another statement.

However, there are two things to watch out for:

1. Never miss the colons at the end of the if and else lines!

2. And never miss the indentation at the beginning of the statement-lines!

29
If you miss any of the above two, an error message will be returned saying “invalid syntax” and
your Python script will fail.

Python if statements – level 2

Here’s a quick example:

a = 10
b = 20
c = 30
if (a + b) / c == 1 and c - b - a == 0:
print('yes')
else:
print('no')
This script will return yes, since both of the conditions, (a + b) / c == 1 and c - b - a == 0 are
actually True and the logical operator between them was: and.

Of course, you can make this even more complex if you want, but the point is: having multiple
operators in an if statement is absolutely possible – in fact, it’s pretty common in real life
scenarios!

Python if statements – level 3

You can take it to the next level again, by using the elif keyword (which is a short form of the
“else if” phrase) to create condition-sequences. “Condition-sequence” sounds fancy but what
really happens here is just adding an if statement into an if statement:

30
Another example:
a = 10
b = 11
c = 10
if a == b:
print('first condition is true')
elif a == c:
print('second condition is true')
else:
print('nothing is true. existence is pain.')
Sure enough the result will be "second condition is true".

You can do this infinite times, and build up a huge if-elif-elif-…-elif-else sequence if you want!
Aaand… This was more or less everything you have to know about Python if statements. It’s time
to:

31
Test yourself!
Here’s a random integer: 918652728452151.
First, I’d like to know 2 things about this number:
 Is it divisible by 17?
 Does it have more than 12 digits?
If both of these conditions are true, then I want to print “super17“.
And if either of the conditions are false, then I’d like to run a second test on it:
 Is it divisible by 13?
 Does it have more than 10 digits?
If both of these two new conditions are true, then I want to print “awesome13“.
And if the original number is not classified as “super17” nor “awesome13“, then I’ll just print:
“meh, this is just an average random number“.
So: is 918652728452151 a super17, an awesome13 or just an average random number?
Okay! Ready. Set. Go!

The solution
918652728452151 is a super17 number!

Take a look at the script:

my_number = 918652728452151
if my_number % 17 == 0 and len(str(my_number)) > 12:
print("super17")
elif my_number % 13 == 0 and len(str(my_number)) > 10:
print("awesome13")
else:
print("meh, this is just a random number")
On the first row, I’ve stored 918652728452151 into a variable so I don’t have to type it again
and my script will be much nicer too: my_number = 918652728452151
Then, I set my if-elif-else condition sequence.
On the if line I wanted to specify two conditions. The first one is that the my_number variable is
divisible by 17. This was the my_number % 17 == 0 part. To be more accurate, this code means
that the remainder from the division by 17 equals zero. The other half of the condition required
counting the number of digits in my_number. Since – by the limitation of Python – you can’t
count the number of digits in an integer, I had to turn my_number into a string with
a str() function and then use the len() function on this string to get the number of characters.
That was the len(str(my_number)).
It seems that both of my original conditions were true since I got back the super17 on my screen.
But if it weren’t then the next elif line would have done the same thing we have done in the if

32
line, only it would have checked the divisibility by 13 (and not 17) and the number of digits
should have been greater than 10 (and not 12.)
If that were not true either, my else statement would have been run to print("meh, this is just a
random number")
That’s the solution! Wasn’t too difficult, was it?

Summary
If statements are widely used in every programming language. Now you know how to use them
too! The logic of it is super clear and on top of that, in Python, the syntax is even fully
understandable by simply speaking in English…
Anyway. This was my introduction into Python If Statements. Next time we will continue
with Python for loops!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

33
Python For Loops Explained (Python for Data Science Basics
#5)
Written by Tomi Mester on January 17, 2018
Last updated on August 03, 2019
Remember that I told you last time that Python if statements are similar to how our brain
processes conditions in our everyday life? That’s true for for loops too. You go through your
shopping list until you’ve collected every item from it. The dealer gives a card for each player
until everyone has five. The athlete does push-ups until reaching one-hundred… Loops
everywhere! As for for loops in Python: they are perfect for processing repetitive programming
tasks. In this article, I’ll show you everything you need to know about them: the syntax, the logic
and best practices too!

Python for loops – two simple examples

First things first: for loops are for iterating through “iterables”. Don’t get confused by the new
term: most of the time these “iterables” will be well-known data types: lists, strings or
dictionaries. Sometimes they can also be range() objects (I’ll get back to this at the end of the
article.)
Let’s take the simplest example first: a list!
Do you remember Freddie, the dog from the previous tutorials? Good news: he’s back! Let’s
create a list:
dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]
Once it’s created, we can go through the elements and print them one by one – by using this
very basic for loop:
for i in dog:
print(i)

The result is the elements of the list one by one, in separate lines:
Freddie
9
True
1.1
2001
['bone', 'little ball']
Wonderful!
But how does this actually become useful? Take another example, a list of numbers:
numbers = [1, 5, 12, 91, 102]
Say that we want to square each of the numbers in this list!
Note: Unfortunately the numbers * numbers formula won’t work… I know, it might sound logical at
first but when you will get more into Python, you will see that in fact, it is not logical at all.

34
We have to do this:
for i in numbers:
print(i * i)
The result will be:

What happened here step by step is:

1. We set a list (numbers) with five elements.
2. We took the first element (well actually, because of zero-based indexing, it’s the 0th
element) of the list (1) and stored it into the i variable.
3. Then we executed the print(i * i) function, which returned the squared value of 1 which is
also 1.
4. Then we started the whole process over…
5. We took the next element – we assigned it to the i variable.
6. We executed print(i * i) again and we got the squared value of the second element: 25.
7. And we continued this process…
… until we got the last element’s squared value.
This was the most basic example of a Python for loop… but no worries, it won’t get more
difficult; only more complex.

The underlying logic of Python for loops

Okay, now that you see that it’s useful, it’s time to understand the underlying logic of Python for
loops…
Just one comment here: in my opinion, this section is the most important part of the article. I see many
people using simple loops like a piece of cake but struggling with more complex ones. The reason is:
they learned the syntax but they don’t get the logic. So please, if anything is unclear, re-read this
section again and again… spend time with it! And I promise, later on, you will thank yourself for this
investment of time because you will profit a lot from it.
Here’s a flowchart to visualize the process:

35
Let’s break down this flowchart and study all the little details… I’ll guide you through it step by
step. As an example I’ll use the previous script with the numbers and their squared values:
numbers = [1, 5, 12, 91, 102]
for i in numbers:
print(i * i)
1.) Define an iterable! (E.g. the list we defined earlier: numbers = [1, 5, 12, 91, 102]).
2.) When you set a for loop, the first line will look pretty similar to this:

for and in are Python keywords and numbers is the name of our list… But, what I want to talk
more about in this section is the i variable. It’s a “temporary” variable and its only role is to store
the given element of the list that we will work with in the given iteration of the loop. Even if this
variable is called i most of the time (in online tutorials or books for instance), it’s good to know
that the naming is totally arbitrary. It could not only be i (for i in numbers), but anything else,
like x (for x in numbers) or hello (for hello in numbers) or whatever you prefer… The point is, set
a variable and don’t forget that you have to refer to it when you want to use it inside the loop.

3.) Stick with our numbers-example! We take the first element of our iterable (well, again –
because of zero-based indexing – technically it’s the 0th element of the list). The first iteration of
the loop will run! The 0th element of our list is 1. So the i variable is set to 1.
Note: More info about zero-based indexing: here.
5.) The function itself, inside the loop, was print(i * i). As i = 1 the result of i * i will be 1. 1 will be
printed to our screen.
6.) The loop starts over.

36
7.) We take the next element and since there is an actual next element of the list, the second
iteration of the loop will run! The 1st element of the numbers list is 5.
8.) So i is 5. print(i * i) runs again and the result is printed to our screen: 25.
9.) The loop starts over. We take the next element.
10.) There is a next element. So here comes the third iteration. The 2nd element of the numbers
list is 12.
11.) print(i * i) is 144.
12.) The loop starts over. The next element exists. The iteration runs again.
13.) The 3rd element is 91. The squared value of it is 8281.
14.) The loop starts over. Next element exists. The iteration runs again.
15.) i is 102. The squared value of it is 10404.
16.) The loop starts over. But there is no more “next element.” So the loop ends.
This is a very, very detailed explanation for a 3 line script, right? Don’t worry, it’s enough if you
only crunch this once. In the future, you can just go ahead and use those 3 simple lines, because
the underlying logic will be in the back of your mind! I find it very important to write this down
though, because many junior data professionals do not have this in their back of mind… and
that reduces the quality of their Python scripts.
Iterating through strings
Okay, going forward!
As I mentioned earlier, you can use other sequences than lists too. Let’s try a string:
my_list = "Hello World!"
for i in my_list:
print(i)

Easy as pie: we got back the characters one by one.

Remember, strings are basically handled as sequences of characters, thus the for loop will work
with them pretty much as it did with lists.

37
Iterating through range() objects
range() is a built-in function in Python and we use it almost exclusively within for loops. What
does it do? In a nutshell: it generates a list of numbers. Let’s see how it works:
my_list = range(0,10)
for i in my_list:
print(i)

It accepts three arguments:

 The first element: this will be the first element of your range.

 The last element: you might assume that this will be the last element of your range… but
it isn’t. It’s a Python thing (well actually you can find this in other programming languages
too), but you can define here the element after your actual last… Let’s make this clear by
an example: if you assign 10 to the last element attribute, the range will go from 0 to 9. If
it’s 11 then the range will go from 0 to 10.
Note: If you want to learn more about why Python range() works this way, check out this
Quora article: Why are Python ranges half-open (exclusive) instead of closed (inclusive)?
 The step: this is the difference between each element in the range. So if it’s 2, you will
only print every second element.
Now, can you guess the result of the range above?
Here it is:

38
Note: the first element and the step attributes are optional. If you don’t specify them, then the first
element will be 0 and the step will be 1 by default. Try this in your Jupyter Notebook and check the
result:
my_list = range(10)
for i in my_list:
print(i)
When range() can be useful? Mostly, in these two cases:
1.) You want to go through numbers. For instance, you want to cube the integers between 0 and
9? Not a problem:
my_list = range(1,10,2)
for i in my_list:
print(i * i * i)

2.) You want to go through a list but want to keep the indexes of the elements too.
my_list = [1, 5, 12, 91, 102]
my_list_length = len(my_list)
for i in range(0,my_list_length):
print(i, my_list[i] * my_list[i])

39
In this case i will be the index and you can get the actual elements of the list with
the my_list[i] syntax – just as we have learned in the Python Data Structures article.
Anyway: use range() – it will make your job with Python for loops easier!

Best practices and common mistakes

Finally let me share some best practices:
1. I know from my data coding workshops that Python for loops are not necessarily easy the
first time. They need some sort of algorithmic thinking. Of course, the more you practice,
the better you become… But if you get a really difficult task, it’s always a good tactic to
get a paper and sketch up the logic first. Go through the first few iterations on paper,
write down the results – and things will be much clearer!
2. Just as with if statements, be careful with the syntax. At the end of the for line, a colon is
required. And at the beginning of the lines in the loop’s body you have to use
indentations.

3. You can’t print strings and integers in one print() function by simply using the + sign. This
is more a print-function-thing than a for-loop-thing but most of the time you will meet
this issue in for loops. E.g.:

If you see this, one of the good solutions is: turning your integers into strings by using

40
the str() function! Here is the previous example with the right syntax:

Okay, we have done this!

Exercise time!
Test yourself!
Here’s a nice for-loop-exercise:
Take a variable and assign a random string to it.
Then print a pyramid of the string like in this example:
my_string = "python"
OUTPUT:
p
py
pyt
pyth
pytho
python
pytho
pyth
pyt
py
p
Write a script that does this for any my_string value!
Okay! Let’s go!
The solution
Note: There is more than one way to solve this task. I’ll show you a relatively simple solution, but feel
free to post your alternative solutions in the comment section below!
my_string = "python"
x=0

for i in my_string:
x=x+1
print(my_string[0:x])

for i in my_string:

41
x=x-1
print(my_string[0:x])

I think the solution is quite self-explanatory. The only trick is that I did set a “counter-variable”
called x that always shows the number of characters that I want to print to the screen in the
given iteration. In the first for loop this goes up until I reach the number of maximum characters.
After that, in the second for loop, it goes down until I have zero characters on the screen.
Note: If the my_string[0:x] syntax does not look familiar, check the Python Data Structures article –
and the “How to access multiple elements of a Python list?” section.
Conclusion
Python for loops are important and they are used widely in data scripts. The syntax is simple, but
as you have seen, to fully understand the logic behind it requires a little bit of brainwork. But by
reading this article, you got through it and now you have a solid foundation to build on. So all

you have to do is a bit more practicing to become master of loops!

In the next article I’ll write about how to combine for loops with for loops and for loops with if
statements. It’s gonna be exciting!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

42
Python For Loops and If Statements Combined (Python for
Data Science Basics #6)
Written by Tomi Mester on April 11, 2018
Last updated on August 03, 2019
Last time I wrote about Python For Loops and If Statements. Today we will talk about how to
combine them. In this article, I’ll show you – through a few practical examples – how to combine
a for loop with another for loop and/or with an if statement!
Note: This is a hands-on tutorial. I highly recommend doing the coding part with me – and if you have
time, solving the exercises at the end of the article! If you haven’t done so yet, please work through
these articles first:
 How to install Python, R, SQL and bash to practice data science!
 Python for Data Science #1 – Tutorial for Beginners – Python Basics
 Python for Data Science #2 – Data Structures
 Python for Data Science #3 – Functions and methods
 Python for Data Science #4 – If statements
 Python for Data Science #5 – For loops
Note 2: On mobile the line breaks of the code snippets might look tricky. But if you copy-paste them
into your Jupyter Notebook, you will see the actual line breaks much clearer!
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
For loop within a for loop – aka the nested for loop
The more complicated the data project you are working on, the higher the chance that you will
bump into a situation where you have to use a nested for loop. This means that you will run an
iteration, then another iteration inside that iteration.
Let’s say you have nine TV show titles put into three categories: comedies, cartoons, dramas.
These are presented in a nested Python list (“lists in a list”):
my_movies = [['How I Met Your Mother', 'Friends', 'Silicon Valley'],
['Family Guy', 'South Park', 'Rick and Morty'],
['Breaking Bad', 'Game of Thrones', 'The Wire']]
You want to count the characters in all these titles and print the results one by one to your
screen, in this format:
"The title [movie_title] is [X] characters long."
How would you do that? Since you have three lists in your main list, to get the movie titles, you
have to iterate through your my_movies list — and inside that list, through every sublist, too:
for sublist in my_movies:
for movie_name in sublist:
char_num = len(movie_name)
print("The title " + movie_name + " is " + str(char_num) + " characters long.")
Note: remember len() is a Python function that results in an integer. To put this integer into a
“printable” sentence, we have to turn it into a string first. I wrote about this in the previous Python For
Loops tutorial.

43
I know, Python for loops can be difficult to understand for the first time… Nested for loops are
even more difficult. If you have trouble understanding what exactly is happening above, get a
pen and a paper and try to simulate the whole script as if you were the computer — go through
your loop step by step and write down the results.
One more thing:
Syntax! The rules are the same ones you learned when we discussed simple for loops — the only
thing that I’d like to emphasize, and that you should definitely watch out for, is the indentations.
Using proper indentations is the only way how you can let Python know that in which for loop
(the inner or the outer) you would like to apply your block of code. Just test out and try to find
the differences between these three examples:

Example 1

Example 2

44
Example 3
If statement within a for loop
Inside a for loop, you can use if statements as well.
Let me use one of the most well-known examples of the exercises that you might be given as the
opening question in a junior data scientist job interview.
The task is:
Go through all the numbers up until 99. Print ‘fizz’ for every number that’s divisible by 3, print
‘buzz’ for every number divisible by 5, and print ‘fizzbuzz’ for every number divisible by 3 and by
5! If the number is not divisible either by 3 or 5, print a dash (‘-‘)!
Here’s the solution!
for i in range(100):
if i % 3 == 0 and i % 5 == 0:
print('fizzbuzz')
elif i % 3 == 0:
print('fizz')
elif i % 5 == 0:
print('buzz')
else:
print('-')

As you can see, an if statement within a for loop is perfect to evaluate a list of numbers in a
range (or elements in a list) and put them into different buckets, tag them, or apply functions on
them – or just simply print them.

45
Again: when you use an if statement within a for loop, be extremely careful with
the indentations because if you misplace them, you can get errors or fake results!
Break
There is a special control flow tool in Python that comes in handy pretty often when using if
statements within for loops. And this is the break statement.
Can you find the first 7-digit number that’s divisible by 137? (The first one and only the first one.)
Here’s one solution:
for i in range(0, 10000000, 137):
if len(str(i)) == 7:
print(i)
break
This loop takes every 137th number (for i in range(0, 10000000, 137)) and it checks during each
iteration whether the number has 7 digits or not (if len(str(i)) == 7). Once it gets to the the first 7-
digit number, the if statement will be True and two things happen:
1. print(i) –» The number is printed to the screen.
2. break breaks out of the for loop, so we can make sure that the first 7-digit number was
also the last 7-digit number that was printed on the screen.

Learn more about the break statement (and its twin brother: the continue statement) in the
original Python3 documentation: here.
Note: you can solve this task more elegantly with a while loop. However, I haven’t written a while loop
tutorial yet, which is why I went with the for loop + break solution!
Test Yourself!
It’s time to test whether you have managed to master the if statement, the for loops and the
combination of these two! Let’s try to solve this small test assignment!
Create a Python script that finds out your age in a maximum of 8 tries! The script can ask you
only one type of question: guessing your age! (e.g. “Are you 67 years old?”) And you can answer
only one of these three options:
 less
 more
 correct
Based on your answer the computer can come up with another guess until it finds out your
exact age.
Note: to solve this task, you will have to learn a new function, too. That’s the input() function! More
info: here.
Ready? 3. 2. 1. Go!
Solution
Here’s my code.
Note 1: One can solve the task with a while loop, too. Again: since I haven’t written about while loops
yet, I’ll show you the for loop solution.
Note 2: If you have an alternative solution, please do not hesitate to share it with me and the other
readers in the comment section below!
down = 0

46
up = 100
for i in range(1,10):
guessed_age = int((up + down) / 2)
answer = input('Are you ' + str(guessed_age) + " years old?")
if answer == 'correct':
print("Nice")
break
elif answer == 'less':
up = guessed_age
elif answer == 'more':
down = guessed_age
else:
print('wrong answer')
My logic goes:
STEP 1) I set a range between 0 and 100 and I assume that the age of the “player” will be
between these two values.
down = 0
up = 100
STEP 2) The script always asks the middle value of this range (for the first try it’s 50):
guessed_age = int((up + down) / 2)
answer = input('Are you ' + str(guessed_age) + " years old?")
STEP 3) Once we have the “player’s” answer, there are four possible scenarios:

o If the guessed age is correct, then the script ends and it returns some answer.
o if answer == 'correct':
o print("Nice")
o break
o If the answer is “less”, then we start the iteration over – but before that we set
the maximum value of the age-range to the guessed age. (So in the second
iteration the script will guess the middle value of 0 and 50.)
o elif answer == 'less':
o up = guessed_age
o We do the same for the “more” answer – except that in this case we change the
minimum (and not the the maximum) value:
o elif answer == 'more':
o down = guessed_age
o And eventually we handle the wrong answers and the typos:
o else:
print('wrong answer')

47
Did you find a better solution?
Share it with me in the comment section below!
Conclusion
Now you’ve got the idea of:
 Python nested for loops and
 for loops and if statements combined.
They are not necessarily considered to be Python basics; this is more like a transition to the
intermediate level. Using them requires a solid understanding of Python3’s logic – and a lot of
practicing, too.
There are only two episodes left from the Python for Data Science Basics tutorial series! Keep it
going and continue with the Python syntax essentials!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi

48
Python Syntax Essentials and Best Practices
Written by Tomi Mester on April 24, 2018
Last updated on August 03, 2019
In my Python workshops and online courses I see that one of the trickiest things for newcomers
is the syntax itself. It’s very strict and many things might seem inconsistent at first. In this article
I’ve collected the Python syntax essentials you should keep in mind as a data professional — and
I added some formatting best practices as well, to help you keep your code nice and clean.
These are the basics. If you want to go deep down the rabbit hole, I’ll link to some advanced
Python syntax and formatting tutorials at the end of this article!
This article is the part of my Python for Data Science article series. If you haven’t done so yet, please
start with these articles first:
 How to install Python, R, SQL and bash to practice data science!
 Python for Data Science #1 – Tutorial for Beginners – Python Basics
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
The 3 major things to keep in mind about Python syntax
#1 Line Breaks Matter
Unlike in SQL, in Python, line breaks matter. Which means that in 99% of cases, if you put a line
break where you shouldn’t put one, you will get an error message. Is it weird? Hey, at least you
don’t have to add semicolons at the end of every line.
So here’s Python syntax rule #1: one statement per line.
There are some exceptions, though. Expressions
 in parentheses (eg. functions and methods),
 in bracket frames (eg. lists),
 and in curly braces (eg. directories)
can actually be split into more lines. This is called implicit line joining and it is a great help when
working with bigger data structures.

implicit line joining example

my_movies = ['How I Met Your Mother',
'Friends',
'Silicon Valley',
'Family Guy',
'South Park']
Additionally, you can also break any expression into more than one line if you use a backslash (\)
at the end of the line. And you can do the opposite, too: inserting more than one statement into
one line using semicolons (;) between the statements. However, these two methods are not too

49
common, and I’d recommend using them only when necessary. (E.g. with really long, 80+
character long statements.)

one line – more statements

a dummy example: one statement in more lines — print(‘Hello’)

#2 Indentations Matter
Do you hate indentations? Well, you are not alone. Many people who are just starting off with
Python dislike the concept. For non-programmers it is unusual and on the top of that it causes
the most errors in their scripts at the beginning. As for me, I love indentations and I promise that
you will get used to them, too. Well, if you’ve worked your way through my Python articles so
far, I’m pretty sure that you already have.
Why do we need indentations? Easy: somehow you have to indicate which code blocks belong
together — e.g. what is the beginning and the end of an if statement or a for loop. In other
languages, where you don’t have to use indentations, you have to use something else for that:
e.g. in JavaScript you have to use extra brackets to frame your code blocks; in bash you have to
use extra keywords. In Python, you have to use indentations – which in my opinion is the most
elegant way to solve this issue.
So we have Python syntax rule #2: make sure that you use indentations correctly and
consistently.
Note: I talked about the exact syntax rules governing for loops and if statements in the relevant
articles.

One more thing: if you watch the Silicon Valley TV show, you might have heard about the debate
of “tabs vs spaces.” Here’s the hilarious scene:

So tabs or spaces? Here’s what the original Style Guide for Python Code says:

50
Pretty straight forward!
ps. To be honest, in Jupyter Notebook, I use tabs.
#3 Case Sensitivity
Python is case sensitive. It makes a difference whether you type and (correct) or AND (won’t
work). As a rule of thumb, learn that most of the Python keywords have to be written with
lowercase letters. The most commonly used exceptions I have to mention here (because I see
many beginners have trouble with it) are the Boolean values. These are correctly spelled
as: True and False. (Not TRUE, nor true.)
There’s Python syntax rule #3: Python is case sensitive.
Other Python Best Practices for Nicer Formatting
Let me just list a few (non-mandatory but highly recommended) Python best practices that will
make your code much nicer, more readable and more reusable.
Python Best Practice #1: Use Comments
You can add comments to your Python code. Simply use the # character. Everything that comes
after the # won’t be executed.
# This is a comment before my for loop.
for i in range(0, 100, 2):
print(i)

#use comments!
Python Best Practice #2: Variable Names
Conventionally, variable names should be written with lower letters, and the words in them
separated by _ characters. Also, generally I do not recommend using one letter variable names in
your code. Using meaningful and easy-to-distinguish variable names helps other programmers a
lot when they want to understand your code.
my_meaningful_variable = 100
Python Best Practice #3: Use blank lines
If you want to separate code blocks visually (e.g. when you have a 100 line Python script in
which you have 10-12 blocks that belong together) you can use blank lines. Even multiple blank
lines. It won’t affect the result of your script.

51
same script – with blank lines
Python Best Practice #4: Use white spaces around operators and assignments
For cleaner code it’s worth using spaces around your = signs and your mathematical and
comparison operators (>, <, +, -, etc.). If you don’t use white spaces, your code will run anyway,
but again: the cleaner the code, the easier to read it, the easier to reuse it.
number_x = 10
number_y = 100
number_mult = number_x * number_y
Python Best Practice #5: Max line length should be 79 characters
If you reach 79 characters in a line, it’s recommended to break your code into more lines. Use
the above-mentioned \ character. Using the \ at the end of the line, Python will ignore the line
break and will read your code as if it were one line.
(Or in some cases you can take advantage of implicit line joining.)

a dummy example: one statement in more lines

Python Best Practice #6: Stay consistent
And one of the most important rules: always stay consistent! Even if you follow the above rules,
in specific situations you’ll have to create your own. Either way: make sure you are using these
rules consistently. Ideally, you have to create Python scripts that you can open 6 months later
without any trouble understanding them. If you randomly change your formatting rules and
naming conventions, you’ll create an unnecessary headache for your future self. So stay
consistent!
The Zen of Python – a nice easter egg
What else could come at the end of this article but a nice Python easter egg.
If you type import this to your Jupyter Notebook you will get the 19 design “commandments” of
Python:
>>> import this
The Zen of Python, by Tim Peters

52
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Use these advices wisely!

Conclusion
Well this is it. Follow this advice, and if you want to learn more about Python syntax essentials
and best practices, I recommend these articles:
 Google’s Python Style Guide,
 PEP8,
 BOBP guide.
Now go ahead and check out the last article of the series: how to import Python libraries!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

53
Python Import Statement and the Most Important Built-in
Modules for Data Scientists
Written by Tomi Mester on May 13, 2018
Last updated on February 01, 2020
So far we have worked with the most essential concepts of Python: variables, data structures,
built-in functions and methods, for loops, and if statements. These are all parts of the core
semantics of the language. But this is far from everything that Python knows… actually this is
just the very beginning and the exciting stuff is yet to come. Because Python also has tons of
modules and packages that we can import into our projects… What does this mean? In this
article I’ll give you an intro: I’ll show you the Python import statement and the most important
built-in modules that you have to know for data science!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science!
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python for Data Science – Basics #2 – Python Data Structures
4. Python for Data Science – Basics #3 – Python Built-in Functions
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
The Python Import Statement
Okay, so what is the import statement and why is it so important?
Think of it like LEGO:

54
Python core semantics
So far we have played around with the base elements of our LEGO playset. But if you want to
build something complex, you have to use more advanced tools.
If you use import, you can get access to the advanced Python “tools” (they are called modules).

55
New tools to access via the Python import statement
These are divided into three groups:
1. The modules of the Python Standard Library:
You can get these really easily because they come with Python3 by default. You simply
have to type import and the name of the module – and from that point on you can use
the given module in your code. In this article, I’ll show you exactly how to do that in
detail.
2. Other, even more advanced and more specialized modules:
There are modules that are not part of the standard library. For these, you have to install
new packages to your data server first. You will see that for data science we are using
many of these “external” packages. (The ones you might have heard about are pandas,
numpy, matplotlib, scikit-learn, etc.) I’ll get back to this topic in another article.
3. Your own modules:
Yes, you can write new modules by yourself, too! (I’ll cover this in my advanced Python
tutorials.)
Anyway, import is a really powerful concept in Python – because with that you’ll be able to
expand your toolset continuously and almost infinitely when you are dealing with different data
science challenges.
The most important Python Built-in Modules for Data Scientists
Okay, now that you get the concept, it’s time to see it in practice. As I have mentioned, there is a
Python Standard Library with dozens of built-in modules. From those, I hand-picked the five
most important modules for data analysts and scientists. These are:

56
 random
 statistics
 math
 datetime
 csv
You can easily import any of them by using this syntax:
import [module_name]
eg. import random
Note: This will import the entire module with all items in it. You can import only a part of the module,
too: from [module_name] import [item_name]. But let’s not complicate things with that yet.
Let’s see the five built-in modules one by one!
Python Built-in Module #1: random
Randomization is very important in data science… just think about experimenting and A/B
testing! If you import the random module, you can generate random numbers by various rules.
Let’s type this to your Jupyter Notebook first:
import random
Then in a separate cell try out:
random.random()
This will generate a random float between 0 and 1.
Try this one, too:
random.randint(1,10)
This will generate a random integer between 1 and 10.

Learn more about random: here.

Python Built-in Module #2: statistics
If randomization is important, statistics is inevitable!
Lucky for us, there is a statistics built-in module which contains functions like: mean, median,
mode, standard deviation, variance and more…
Let’s try few of these:
import statistics
Create a sample list:
a = [0, 1, 1, 3, 4, 9, 15]
Then calculate all the above mentioned values for this small list:
statistics.mean(a)
statistics.median(a)
statistics.mode(a)
statistics.stdev(a)
statistics.variance(a)

57
Learn more about the statistics module here.
Python Built-in Module #3: math
There are a few functions that are under the umbrella of math rather than statistics. So there is a
separate module for that. This contains factorial, power, and logarithmic functions, but also some
trigonometry and constants.
Try this:
import math
And then:
math.factorial(5)
math.pi
math.sqrt(5)
math.log(256, 2)

58
Learn more about math module here.
Python Built-in Module #4: datetime
Do you plan to work for an online startup? Then you will probably encounter lot of data logs.
And the heart of a data log is the datetime. Python3, by default, does not handle dates and
times, but if you import the datetime module, you will get access to these functions, too.
import datetime
To be honest, I think the implementation of the datetime module of Python is a bit over-
complicated… at least, it’s not easy to use for beginners. I’ll write a separate article about it later.
But for now let’s try these two functions to get a bit more familiar with it:
datetime.datetime.now()
datetime.datetime.now().strftime("%F")

Python Built-in Module #5: csv

“csv” stands for “comma-separated values” and it’s one of the most common file formats for plain
text data logs. So you definitely have to know how to open a .csv file in Python. There is a
certain way to do that – just follow this example.
Let’s say you have this small .csv file. (You can even create it in Jupyter.) fruits.csv:
2018-04-10;1001;banana
2018-04-11;1002;orange
2018-04-12;1003;apple

59
If you want to open this file in your Jupyter Notebook, you have to apply this code:
import csv

with open('fruits.csv') as csvfile:

my_csv_file = csv.reader(csvfile, delimiter=';')
for row in my_csv_file:
print(row)

As you can see, it returned Python lists. So with the list selection features and with the list
methods that we have learned previously, you can also break down and restructure this data.
Learn more about the csv module: here.
More built-in modules
This is a good start but far from the whole list of the Python built-in modules. With other
modules you can zip and unzip files, scrape websites, send emails, encode and decode JSON files
and do a lot of other exciting things. If you want to take a look at the whole list, check out
the Python Standard Library which is part of the original Python documentation.
And, as I mentioned, there are other Python libraries and packages that are not part of the
standard library (like pandas, numpy, scipy, etc.) – I’ll write more about them soon!
Syntax
Now that you have seen how import works, let’s talk briefly about the syntax!

60
Three things:
1. Usually, in Python scripts, we put all the import statements at the beginning of our script.
Why is that? To see what modules our script relies on. Also, to make sure that the
modules will be imported before we need to apply them. So keep this advice in
mind: import statements come at the beginning of your Python scripts.

2. In this article, we applied the functions of the modules using this syntax:
module_name.function_name(parameters)
Eg. statistics.median(a)
or
csv.reader(csvfile, delimiter=';')This is logical: before you apply a given function, you have
to tell Python in which module to find it.In some cases there are even more complicated
relationships – like functions of classes in a module (eg. datetime.datetime.now()) but let’s
not confuse yourself with that for now. My suggestion is to make a list of your favorite
modules and functions and learn how they work; if you need a new one, check out the
original Python documentation and add the new module plus its function to your list.
3. When you import a module (or a package) you can rename it using the as keyword:
If you type:
import statistics as stat
you have to refer to your module as stat. Eg. stat.median(a) and not
as statistics.median(a).Conventionally, we are using two very well-known data science
related Python libraries imported with their shortened name: numpy (import numpy as
np) and pandas (import pandas as pd). I’ll get back to this in another article!
So what’s the name of it? Package? Module? Function? Library?
When I first encountered this import concept, I had a hard time understanding what exactly I was
importing. In some cases these things were referred as “modules”, some cases as “packages”, in
other cases as “functions”, and sometimes as “libraries”.
Note: Even if you check numpy and pandas – the two most popular data science related python
libraries (or packages??). One is called library, the other is called a package.

61
the numpy >>package<<

the pandas >>library<<

Truth be told, I’m not a big fan of academical or theoretical questions in programming, like how
we name things. But I have to admit that if you want to keep up a meaningful discussion with a
developer (or even just ask a question on Stackoverflow), you have to have at least a clue about
what is what.
So here is the result of my research – a little bit simplified of course:
 Function: it’s a block of code that you can (re-)use by calling it with a keyword.
Eg. print() is a function.
 Module: it’s a .py file that contains a list of functions (it can also contain variables and
classes). Eg. in statistics.mean(a), mean is a function that is found in the statistics module.
 Package: it’s a collection of Python modules. Eg. numpy.random.randint(2,
size=10) randint() is a function in the random module of the numpy package.
 Library: it’s a more general term for a collection of Python codes.
The best answers I found on this topic are below this Stackoverflow question.
Conclusion
import is an essential and powerful concept within Python. The more you learn about data
science, the better you will understand that you will have to continuously expand your toolset
for the different challenges you will face. import is the ultimate tool for that. It opens up

62
thousands of new doors. In my upcoming articles I’ll introduce the most important Python
libraries and packages you have to know as a data scientist!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

63
Python libraries and packages for Data Scientists (the 5 most
important ones)
Written by Tomi Mester on June 26, 2018
Last updated on July 22, 2020
Did you know that Python wasn’t originally built for Data Science? And yet today it’s one of the
best languages for statistics, machine learning, and predictive analytics as well as simple data
analytics tasks. How come? It’s an open-source language, and data professionals started creating
tools for it to complete data tasks more efficiently. Here, I’ll introduce the most important
Python libraries and packages that you have to know as a Data Scientist.
In my previous article, I introduced the Python import statement and the most important
modules from the Python Standard Library. In this one, I’ll focus on the libraries and packages
that are not coming with Python 3 by default. At the end of the article, I’ll also show you how to
get (download, install and import) them.
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python for Data Science – Basics #2 – Python Data Structures
4. Python for Data Science – Basics #3 – Python Built-in Functions
5. Python Import Statement and the Most Important Built-in Modules
Top 5 most important Python libraries and packages for Data Science
 Numpy
 Pandas
 Matplotlib
 Scikit-Learn
 Scipy
These are the five most essential Data Science libraries you have to know.
Let’s see them one by one!
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
Numpy
Numpy will help you to manage multi-dimensional arrays very efficiently. Maybe you won’t do
that directly, but since the concept is a crucial part of data science, many other libraries (well,
almost all of them) are built on Numpy. Simply put: without Numpy you won’t be able to use
Pandas, Matplotlib, Scipy or Scikit-Learn. That’s why you need it on the first hand.

64
3-dimensional numpy array
But on the other hand, it also has a few well-implemented methods. I quite often use Numpy’s
random function, which I found slightly better than the random module of the standard library.
And when it comes to simple predictive analytics tasks like linear or polynomial regression,
Numpy’s polyfit function is my favorite. (More about that in another article.)

65
prediction with numpy’s polyfit
Pandas
To analyze data, we like to use two-dimensional tables – like in SQL and in Excel. Originally,
Python didn’t have this feature. Weird, isn’t it? But that’s why Pandas is so important! I like to
say, Pandas is the “SQL of Python.” (Eh, I can’t wait to see what I will get for this sentence in the
comment section… ;-)) Okay, to be more precise: Pandas is the library that will help us to handle
two-dimensional data tables in Python. In many senses it’s really similar to SQL, though.

66
a pandas dataframe
With pandas, you can load your data into data frames, you can select columns, filter for specific
values, group by values, run functions (sum, mean, median, min, max, etc.), merge dataframes and
so on. You can also create multi-dimensional data-tables.
That’s a common misunderstanding, so let me clarify: Pandas is not a predictive analytics or
machine learning library. It was created for data analysis, data cleaning, data handling and data
discovery… By the way, these are the necessary steps before you run machine learning projects,
and that’s why you will need pandas for every scientific project, too.
If you start with Python for Data Science and you learned the basics of Python, I recommend
that you focus on learning Pandas next. These short article series of mine will help you: Pandas
for Data Scientists.
Matplotlib
I hope I don’t have to detail why data visualization is important. Data visualization helps you to
better understand your data, discover things that you wouldn’t discover in raw format and
communicate your findings more efficiently to others.
The best and most well-known Python data visualization library is Matplotlib. I wouldn’t say it’s
easy to use… But usually if you save for yourself the 4 or 5 most commonly used code blocks for
basic line charts and scatter plots, you can create your charts pretty fast.

67
matplotlib dataviz example
Here’s another article that introduces Matplotlib more in-depth: How to use matplotlib.
Scikit-Learn
Without any doubt the fanciest things in Python are Machine Learning and Predictive Analytics.
And the best library for that is Scikit-Learn, which simply defines itself as “Machine Learning in
Python.” Scikit-Learn has several methods, basically covering everything you might need in the
first few years of your data career: regression methods, classification methods, and clustering, as
well as model validation and model selection. You can also use it for dimensionality reduction
and feature extraction.

68
(Get started with my machine learning tutorials here: Linear Regression in Python using sklearn
and numpy!)

a simple classification with a random forest model in Scikit Learn

Note: You will see that machine learning with Scikit-Learn is nothing but importing the right modules
and running the model fitting method of them… That’s not the challenging part – it’s rather the data
cleaning, the data formatting, the data preparation, and finding the right input values and the right
model. So before you start using Scikit-Learn, I suggest two things. First – as I already said – master
your basic Python and pandas skills to become great at data preparation. Secondly, make sure you
understand the theory and the mathematical background of the different prediction and classification
models, so you know what happens with your data when you apply them.

69
source: pythonprogramming.net
Scipy
This is kind of confusing, but there is a Scipy library and there is a Scipy stack. Most of the
libraries and packages I wrote about in this article are part of the Scipy stack (that is for scientific
computing in Python). And one of these components is the Scipy library itself, which provides
efficient solutions for numerical routines (the math stuff behind machine learning models). These
are: integration, interpolation, optimization, etc.
Just like Numpy, you most probably won’t use Scipy itself, but the above-mentioned Scikit-Learn
library highly relies on it. Scipy provides the core mathematical methods to do the complex
machine learning processes in Scikit-learn. That’s why you have to know it.
More Python libraries and packages for data science…
What about image processing, natural language processing, deep learning, neural nets, etc.? Of
course, there are numerous very cool Python libraries and packages for these, too. In this article,
I won’t cover them because I think, for a start, it’s worth taking time to get familiar with the

70
above mentioned five libraries. Once you get fluent with them, then and only then you can go
ahead and expand your horizons with more specific data science libraries.
How to get Pandas, Numpy, Matplotlib, Scikit-Learn and Scipy?
First of all, you have to set up a basic data server by following my original How to install Python,
R, SQL and bash to practice data science article. Once you have that, you can install these tools
additionally, one by one. Just follow these five steps:
1. Login to your data server!
2. Install numpy using this command:
sudo -H pip3 install numpy
3. Install pandas using this command:
sudo apt-get install python3-pandas
4. Upgrade some additional tools of pandas using these two commands:
sudo -H pip3 install --upgrade beautifulsoup4
sudo -H pip3 install --upgrade html5lib
5. Upgrade Scipy:
sudo -H pip3 install --upgrade scipy
6. Install scikit-learn using this command:
sudo -H pip3 install scikit-learn
Once you have them installed, import them (or specific modules of them) into your Jupyter
notebook by using the right import statements. For instance:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
After this you can even test pandas and matplotlib together by running these few lines:
df = pd.DataFrame({'a':[1,2,3,4,5,6,7],
'b':[1,4,9,16,25,36,49]})
df.plot()

71
If you need detailed, step-by-step guidance with this setup process, check out my Install Python,
R, SQL and bash – to practice Data Science and Coding! video course.
Conclusion
The five most essential Data Science libraries and packages are:
 Numpy
 Pandas
 Matplotlib
 Scikit-Learn

72
 Scipy
Get them, learn them, use them and they will open a lot of new doors in your data science
career!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

73
Pandas Tutorial 1: Pandas Basics (Reading Data Files,
DataFrames, Data Selection)
Written by Tomi Mester on July 10, 2018
Last updated on August 03, 2019
Pandas is one of the most popular Python libraries for Data Science and Analytics. I like to say
it’s the “SQL of Python.” Why? Because pandas helps you to manage two-dimensional data
tables in Python. Of course, it has many more features. In this pandas tutorial series, I’ll show you
the most important (that is, the most often used) things that you have to know as an Analyst or a
Data Scientist. This is the first episode and we will start from the basics!
Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
 How to install Python, R, SQL and bash to practice data science
 Python for Data Science – Basics #1 – Variables and basic operations
 Python Import Statement and the Most Important Built-in Modules
 Top 5 Python Libraries and Packages for Data Scientists
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
To follow this pandas tutorial…
1. You will need a fully functioning data server with Python3, numpy and pandas on it.
Note 1 : Again, with this tutorial you can set up your data server and Python3. And with this
article you can set up numpy and pandas, too.
Note 2: or take this step-by-step data server set up video course.
2. Next step: log in to your server and fire up Jupyter. Then open a new Jupyter Notebook
in your favorite browser. (If you don’t know how to do that, I really do recommend going
through the articles I linked in the “Before we start” section.)
Note: I’ll also rename my Jupyter Notebook to “pandas_tutorial_1”.

74
Firing up Jupyter Notebook
3. Import numpy and pandas to your Jupyter Notebook by running these two lines in a cell:
4. import numpy as np
5. import pandas as pd

Note: It’s conventional to refer to ‘pandas’ as ‘pd’. When you add the as pd at the end of your import
statement, your Jupyter Notebook understands that from this point on every time you type pd, you
are actually referring to the pandas library.
Okay, now we have everything! Let’s start with this pandas tutorial!
The first question is:
How to open data files in pandas

75
You might have your data in .csv files or SQL tables. Maybe Excel files. Or .tsv files. Or
something else. But the goal is the same in all cases. If you want to analyze that data using
pandas, the first step will be to read it into a data structure that’s compatible with pandas.
Pandas data structures
There are two types of data structures in pandas: Series and DataFrames.
Series: a pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can
store values — and for every value it holds a unique index, too.

Pandas Series example

DataFrame: a pandas DataFrame is a two (or more) dimensional data structure – basically a table
with rows and columns. The columns have names and the rows have indexes.

Pandas DataFrame example

In this pandas tutorial, I’ll focus mostly on DataFrames. The reason is simple: most of the
analytical methods I will talk about will make more sense in a 2D datatable than in a 1D array.
Loading a .csv file into a pandas DataFrame
Okay, time to put things into practice! Let’s load a .csv data file into pandas!
There is a function for it, called read_csv().

76
Start with a simple demo data set, called zoo! This time – for the sake of practicing – you will
create a .csv file for yourself! Here’s the raw data:
animal,uniq_id,water_need
elephant,1001,500
elephant,1002,600
elephant,1003,550
tiger,1004,300
tiger,1005,320
tiger,1006,330
tiger,1007,290
tiger,1008,310
zebra,1009,200
zebra,1010,220
zebra,1011,240
zebra,1012,230
zebra,1013,220
zebra,1014,100
zebra,1015,80
lion,1016,420
lion,1017,600
lion,1018,500
lion,1019,390
kangaroo,1020,410
kangaroo,1021,430
kangaroo,1022,410
Go back to your Jupyter Home tab and create a new text file…

…then copy-paste the above zoo data into this text file…

77
… and then rename this text file to zoo.csv!

Okay, this is our .csv file.

Now, go back to your Jupyter Notebook (that I named ‘pandas_tutorial_1’) and open this freshly
created .csv file in it!
Again, the function that you have to use is: read_csv()
Type this to a new cell:
pd.read_csv('zoo.csv', delimiter = ',')

78
And there you go! This is the zoo.csv data file, brought to pandas. This nice 2D table? Well, this is
a pandas dataframe. The numbers on the left are the indexes. And the column names on the top
are picked up from the first row of our zoo.csv file.

To be honest, though, you will probably never create a .csv data file for yourself, like we just
did… you will use pre-existing data files. So you have to learn how to download .csv files to your
server!

79
If you are here from the Junior Data Scientist’s First Month video course then you have already
dealt with downloading your .txt or .csv data files to your data server, so you must be pretty
proficient in it… But if you are not here from the course (or if you want to learn another way to
download a .csv file to your server and to get another exciting dataset), follow these steps:
I’ve uploaded a small sample dataset here: DATASET
(Link: 46.101.230.157/dilan/pandas_tutorial_read.csv)
If you click the link, the data file will be downloaded to your computer. But you don’t want to
download this data file to your computer, right? You want to download it to your server and then
load it to your Jupyter Notebook. It only takes two steps.
STEP 1) Go back to your Jupyter Notebook and type this command:
!wget 46.101.230.157/dilan/pandas_tutorial_read.csv

This downloaded the pandas_tutorial_read.csv file to your server. Just check it out:

See? It’s there.

If you click it…

…you can even check out the data in it.

STEP 2) Now, go back again to your Jupyter Notebook and use the same read_csv function that
we have used before (but don’t forget to change the file name and the delimiter value):
pd.read_csv('pandas_tutorial_read.csv', delimiter=';')
The data is loaded into pandas!

80
Does something feel off? Yes, this time we didn’t have a header in our csv file, so we have to set
it up manually! Add the names parameter to your function!
pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country',
'user_id', 'source', 'topic'])

Better!
And with that, we finally loaded our .csv data into a pandas dataframe!
Note 1: Just so you know, there is an alternative method. (I don’t prefer it though.) You can load the
.csv data using the URL directly. In this case the data won’t be downloaded to your data server.

81
read the .csv directly from the server (using its URL)
Note 2: If you are wondering what’s in this data set – this is the data log of a travel blog. This is a log
of one day only (if you are a JDS course participant, you will get much more of this data set on the last
week of the course ;-)). I guess the names of the columns are fairly self-explanatory.
Selecting data from a dataframe in pandas
This is the first episode of this pandas tutorial series, so let’s start with a few very basic data
selection methods – and in the next episodes we will go deeper!
1) Print the whole dataframe
The most basic method is to print your whole data frame to your screen. Of course, you don’t
have to run the pd.read_csv() function again and again and again. Just store its output the first
time you run it!
article_read = pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime',
'event', 'country', 'user_id', 'source', 'topic'])
After that, you can call this article_read value anytime to print your DataFrame!

2) Print a sample of your dataframe

82
Sometimes, it’s handy not to print the whole dataframe and flood your screen with data. When a
few lines is enough, you can print only the first 5 lines – by typing:
article_read.head()

Or the last few lines by typing:

article_read.tail()

Or a few random lines by typing:

article_read.sample(5)

3) Select specific columns of your dataframe

This one is a bit tricky! Let’s say you want to print the ‘country’ and the ‘user_id’ columns only.
You should use this syntax:
article_read[['country', 'user_id']]

83
Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I
admit, but maybe this will help you remember: the outer bracket frames tell pandas that you
want to select columns, and the inner brackets are for the list (remember? Python lists go between
bracket frames) of the column names.
By the way, if you change the order of the column names, the order of the returned columns will
change, too:
article_read[['user_id', 'country']]

This is the DataFrame of your selected columns.

Note: Sometimes (especially in predictive analytics projects), you want to get Series objects instead of
DataFrames. You can get a Series using any of these two syntaxes (and selecting only one column):

84
article_read.user_id
article_read['user_id']

output is a Series object and not a DataFrame object

4) Filter for specific values in your dataframe
If the previous one was a bit tricky, this one will be really tricky!
Let’s say, you want to see a list of only the users who came from the ‘SEO’ source. In this case
you have to filter for the ‘SEO’ value in the ‘source’ column:
article_read[article_read.source == 'SEO']
It’s worth it to understand how pandas thinks about data filtering:
STEP 1) First, between the bracket frames it evaluates every line: is
the article_read.source column’s value 'SEO' or not? The results are boolean values
(True or False).

STEP 2) Then from the article_read table, it prints every row where this value is True and doesn’t
print any row where it’s False.

85
Does it look over-complicated? Maybe. But this is the way it is, so let’s just learn it because you

will use this a lot!

Functions can be used after each other
It’s very important to understand that pandas’s logic is very linear (compared to SQL, for
instance). So if you apply a function, you can always apply another one on it. In this case, the
input of the latter function will always be the output of the previous function.
E.g. combine these two selection methods:
article_read.head()[['country', 'user_id']]
This line first selects the first 5 rows of our data set. And then it takes only the ‘country’ and the
‘user_id’ columns.
Could you get the same result with a different chain of functions? Of course you can:
article_read[['country', 'user_id']].head()
In this version, you select the columns first, then take the first five rows. The result is the same –
the order of the functions (and the execution) is different.
One more thing. What happens if you replace the ‘article_read’ value with the
original read_csv() function:
pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country',
'user_id', 'source', 'topic'])[['country', 'user_id']].head()
This will work, too – only it’s ugly (and inefficient). But it’s really important that you understand
that working with pandas is nothing but applying the right functions and methods, one by one.
Test yourself!
As always, here’s a short assignment to test yourself! Solve it, so the content of this article can
sink in better!
Select the user_id, the country and the topic columns for the users who are from country_2!
Print the first five rows only!
Okay, go ahead and solve it!
.
.
.
And here’s my solution!
It can be a one-liner:
article_read[article_read.country == 'country_2'][['user_id','topic', 'country']].head()
Or, to be more transparent, you can break this into more lines:
ar_filtered = article_read[article_read.country == 'country_2']
ar_filtered_cols = ar_filtered[['user_id','topic', 'country']]

86
ar_filtered_cols.head()

Either way, the logic is the same. First you take your original dataframe (article_read), then you
filter for the rows where the country value is country_2 ([article_read.country == 'country_2']),
then you take the three columns that were required ([['user_id','topic', 'country']]) and eventually
you take the first five rows only (.head()).
Conclusion
You are done with the first episode of my pandas tutorial series! Great job! In the next article,
you can learn more about the different aggregation methods (e.g. sum, mean, max, min) and
about grouping (so basically about segmentation). Stay with me: Pandas Tutorial, Episode 2!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

87
Pandas Tutorial 2: Aggregation and Grouping
Written by Tomi Mester on July 23, 2018
Last updated on August 03, 2019
Let’s continue with the pandas tutorial series. This is the second episode, where I’ll introduce
aggregation (such as min, max, sum, count, etc.) and grouping. Both are very commonly used
methods in analytics and data science projects – so make sure you go through every detail in this
article!
Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python Import Statement and the Most Important Built-in Modules
4. Top 5 Python Libraries and Packages for Data Scientists
5. Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection)
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
Data aggregation – in theory
Aggregation is the process of turning the values of a dataset (or a subset of it) into one single
value. Let me make this clear! If you have a DataFrame like…

animal water_need

zebra 100

lion 350

elephant 670

kangaroo 200

…then a simple aggregation method is to calculate the summary of the water_needs, which is
100 + 350 + 670 + 200 = 1320. Or a different aggregation method would be to count the
number of the animals, which is 4. So the theory is not too complicated. Let’s see the rest in
practice…
Data aggregation – in practice
Where did we leave off last time? We opened a Jupyter notebook, imported pandas and numpy
and loaded two datasets: zoo.csv and article_reads. We will continue from here – so if you
haven’t done the “pandas tutorial – episode 1“, it’s time to go through it!

88
Okay!
Let’s start with our zoo dataset! (If you want to download it again, you can find it at this link.) We
have loaded it by using:
pd.read_csv('zoo.csv', delimiter = ',')

Let’s store this dataframe into a variable called zoo.

zoo = pd.read_csv('zoo.csv', delimiter = ',')

Okay, let’s do five things with this data:

1. Let’s count the number of rows (the number of animals) in zoo!
2. Let’s calculate the total water_need of the animals!
3. Let’s find out which is the smallest water_need value!
4. And then the greatest water_need value!
5. And eventually the average water_need!
Pandas Data Aggregation #1: .count()
Counting the number of the animals is as easy as applying a count function on
the zoo dataframe:
zoo.count()

89
Oh, hey, what are all these lines? Actually, the .count() function counts the number of values
in each column. In the case of the zoo dataset, there were 3 columns, and each of them had 22
values in it.
If you want to make your output clearer, you can select the animal column first by using one of
the selection operators from the previous article:
zoo[['animal']].count()

Or in this particular case, the result could be even nicer if you use this syntax:
zoo.animal.count()
This also selects only one column, but it turns our pandas dataframe object into a pandas series
object. (Which means that the output format is slightly different.)

Pandas Data Aggregation #2: .sum()

Following the same logic, you can easily sum the values in the water_need column by typing:
zoo.water_need.sum()

Just out of curiosity, let’s run our sum function on all columns, as well:
zoo.sum()

Note: I love how .sum() turns the words of the animal column into one string of animal names. (By the
way, it’s very much in line with the logic of Python.)
Pandas Data Aggregation #3 and #4: .min() and .max()
What’s the smallest value in the water_need column? I bet you have figured it out already:
zoo.water_need.min()

90
And the max value is pretty similar:
zoo.water_need.max()

Pandas Data aggregation #5 and #6: .mean() and .median()

Eventually, let’s calculate statistical averages, like mean and median:
zoo.water_need.mean()

zoo.water_need.median()

Okay, this was easy. Much, much easier than the aggregation methods of SQL.
But let’s spice this up with a little bit of grouping!
Grouping in pandas
As a Data Analyst or Scientist you will probably do segmentations all the time. For instance, it’s
nice to know the mean water_need of all animals (we have just learned that it’s 347.72). But very
often it’s much more actionable to break this number down – let’s say – by animal types. With
that, we can compare the species to each other – or we can find outliers.
Here’s a simplified visual that shows how pandas performs “segmentation” (grouping and
aggregation) based on the column values!

91
Pandas .groupby in action
Let’s do the above presented grouping and aggregation for real, on our zoo DataFrame!
We have to fit in a groupby keyword between our zoo variable and our .mean() function:
zoo.groupby('animal').mean()

Just as before, pandas automatically runs the .mean() calculation for all remaining columns
(the animal column obviously disappeared, since that was the column we grouped by). You can
either ignore the uniq_id column, or you can remove it afterwards by using one of these
syntaxes:
zoo.groupby('animal').mean()[['water_need']] –» This returns a DataFrame object.
zoo.groupby('animal').mean().water_need –» This returns a Series object.

92
Obviously, you can change the aggregation method from .mean() to anything we learned above!
Okay! Now you know everything, you have to know!
It’s time to…
Test yourself #1
Let’s get back to our article_read dataset.
(Note: Remember, this dataset holds the data of a travel blog. If you don’t have the data yet, you can
download it from here. Or you can go through the whole download, open, store process step by step
by reading the previous episode of this pandas tutorial.)

If you have everything set, here’s my first assignment:

What’s the most frequent source in the article_read dataframe?
.
.
.
And the solution is: Reddit!
How did I get it? Use this code:
article_read.groupby('source').count()
Take the article_read dataset, create segments by the values of the source column
(groupby('source')), and eventually count the values by sources (.count()).

93
You can – optionally – remove the unnecessary columns and keep the user_id column only:
article_read.groupby('source').count()[['user_id']]
Test yourself #2
Here’s another, slightly more complex challenge:
For the users of country_2, what was the most frequent topic and source combination? Or in
other words: which topic, from which source, brought the most views from country_2?
.
.
.
The result is: the combination of Reddit (source) and Asia (topic), with 139 reads!
And the Python code to get this results is:
article_read[article_read.country == 'country_2'].groupby(['source', 'topic']).count()

Here’s a brief explanation:

First, we filtered for the users of country_2 (article_read[article_read.country == 'country_2']).
Then on this subset, we applied a groupby pandas method… Oh, did I mention that you can

94
group by multiple columns? Now you know that! (Syntax-wise, watch out for one thing:
you have to put the name of the columns into a list. That’s why the bracket frames go between
the parentheses.) (That was the groupby(['source', 'topic']) part.)
And as per usual: the count() function is the last piece of the puzzle.
Conclusion
This was the second episode of my pandas tutorial series. I hope now you see that aggregation
and grouping is really easy and straightforward in pandas… and believe me, you will use them a
lot!
Note: If you have used SQL before, I encourage you to take a break and compare the pandas and the
SQL methods of aggregation. With that you will understand more about the key differences between
the two languages!
In the next article, I’ll show you the four most commonly used “data wrangling”
methods: merge, sort, reset_index and fillna. Stay with me: Pandas Tutorial, Episode 3!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

Pandas Tutorial 3: Important Data Formatting Methods

(merge, sort, reset_index, fillna)
Written by Tomi Mester on August 13, 2018
Last updated on October 13, 2020
This is the third episode of my pandas tutorial series. In this one I’ll show you four data
formatting methods that you might use a lot in data science projects. These
are: merge, sort, reset_index and fillna! Of course, there are many others, and at the end of the
article, I’ll link to a pandas cheat sheet where you can find every function and method you could
ever need. Okay! Let’s get started!
Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!
Note 2: some of my code examples will be broken into more lines (by my blog engine) to fit your
screen, while in your Jupyter Notebook they should be written in one line. If you copy-paste, it won’t
be a problem – but if you don’t: watch out for these lines! By the way, if you made it this far with my
tutorials, I’m pretty sure, you can handle this issue easily by yourself (but screenshots are also there to

help you.)
Before we start
If you haven’t done so yet, I recommend going through these articles first:
1. How to install Python, R, SQL and bash to practice data science
2. Python for Data Science – Basics #1 – Variables and basic operations
3. Python Import Statement and the Most Important Built-in Modules
4. Top 5 Python Libraries and Packages for Data Scientists
5. Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection)
6. Pandas Tutorial 2: Aggregation and Grouping

95
Python for Data Science Cheat Sheet
Do you want to learn faster? Join the Data36 Inner Circle and download the Python for Data
Science Cheat Sheet. Just enter your email address:
I accept Data36's Privacy Policy. (No spam. Only useful data science related content. When you
subscribe, I’ll keep you updated with a couple emails per week. You'll get articles, courses, cheatsheets,
tutorials and lots of cool stuff that I only share with the Data36 "inner circle.")
Get Access Now!
Pandas Merge (a.k.a. “joining” dataframes)
In real life data projects, we usually don’t store all the data in one big data table. We store it in a
few smaller ones instead. There are many reasons behind this; by using multiple data tables, it’s
easier to manage your data, it’s easier to avoid redundancy, you can save some disk space, you
can query the smaller tables faster, etc.
The point is that it’s quite usual that during your analysis you have to pull your data from two or
more different tables. The solution for that is called merge.
Note: Although it’s called merge in pandas, it’s almost the same as SQL’s JOIN method.
Let me show you an example! Let’s take our zoo dataframe (from our previous tutorials) in which
we have all our animals… and let’s say that we have another dataframe, zoo_eats, that contains
information about the food requirements for each species.

We want to merge these two pandas dataframes into one big dataframe. Something like this:

96
In this table, it’s finally possible to analyze, for instance, how many animals in our zoo eat meat or
vegetables.
How did I do the merge?
First of all, you have the zoo dataframe already, but for this exercise you will have to create
a zoo_eats dataframe, too. For your convenience, here’s the raw data of the zoo_eats dataframe:
animal;food
elephant;vegetables
tiger;meat
kangaroo;vegetables
zebra;vegetables
giraffe;vegetables
If I were you, to put this into a proper pandas dataframe, I’d follow the process from the Pandas

Tutorial 1 article, but if you want to do this the lazy way, here’s a shortcut. Just copy-paste
this (really long) one line into the pandas_tutorial_1 Jupyter Notebook we made in the first
Pandas tutorial:
zoo_eats = pd.DataFrame([['elephant','vegetables'], ['tiger','meat'], ['kangaroo','vegetables'],
['zebra','vegetables'], ['giraffe','vegetables']], columns=['animal', 'food'])

97
And there is your zoo_eats dataframe!
Okay, now let’s see the pandas merge method:
zoo.merge(zoo_eats)

(Oh, hey, where are all the lions? We will get back to that soon, I promise!)
Bamm! Simple, right? Just in case, let’s see what’s happening here:
First, I specified the first dataframe (zoo), then I applied the .merge() pandas method on it and as
a parameter I specified the second dataframe (zoo_eats). I could have done this the other way
around:
zoo_eats.merge(zoo)
is symmetric to:
zoo.merge(zoo_eats)
The only difference between the two is the order of the columns in the output table. (Just try it!)

98
Pandas Merge… But how? Inner, outer, left or right?
As you can see, the basic merge method is pretty simple. Sometimes you have to add a few extra
parameters though.
One of the most important questions is how you want to merge these tables. In SQL, we learned
that there are different JOIN types.

SQL JOIN types and pandas merge types

The theory is exactly the same for pandas merge.
When you do an INNER JOIN (that’s the default both in SQL and pandas), you merge only those
values that are found in both tables. On the other hand, when you do the OUTER JOIN, it
merges all values, even if you can find some of them in only one of the tables.
Let’s see a concrete example: did you realize that there is no lion value in zoo_eats? Or that we
don’t have any giraffes in zoo? When we did the merge above, by default, it was an INNER
merge, so it filtered out giraffes and lions from the result table. But there are cases in which we
do want to see these values in our joined dataframe. Let’s try this:
zoo.merge(zoo_eats, how = 'outer')

99
See? Lions came back, the giraffe came back… The only thing is that we have empty (NaN) values
in those columns where we didn’t get information from the other table.
In my opinion, in this specific case, it would make more sense to keep lions in the table but not
the giraffes… With that, we could see all the animals in our zoo and we would have three food
categories: vegetables, meat and NaN (which is basically “no information”). Keeping the giraffe
line would be misleading and irrelevant since we don’t have any giraffes in our zoo anyway.
That’s when merging with a how = 'left' parameter becomes handy!
Try this:
zoo.merge(zoo_eats, how = 'left')

100
Everything you do need, and nothing you don’t… The how = 'left' parameter brought all the
values from the left table (zoo) but brought only those values from the right table (zoo_eats) that
we have in the left one, too. Cool!
Let’s take a look at our merge types again:

101
Note: a common question I get is “What’s the “safest” way of merging? Should you go with inner,
outer, left or right, as a best practice?” My answer is: there is no categorical answer for this question.
While inner is the default merge type in pandas, whether you should go with that, or change to outer,
left or right, really depends on the task itself.
Pandas Merge. On which column?
For doing the merge, pandas needs the key-columns you want to base the merge on (in our case
it was the animal column in both tables). If you are not so lucky that pandas automatically
recognizes these key-columns, you have to help it by providing the column names. That’s what
the left_on and right_on parameters are for!
For example, our latest left merge could have looked like this, as well:
zoo.merge(zoo_eats, how = 'left', left_on = 'animal', right_on = 'animal')
Note: again, in the previous examples pandas automatically found the key-columns anyway… but
there are many cases when it doesn’t. So keep left_on and right_on in mind.
Okay, pandas merge was quite complex; the rest of the methods I’ll show you here will be much
easier.
Sorting in pandas
Sorting is essential. The basic sorting method is not too difficult in pandas. The function is
called sort_values() and it works like this:
zoo.sort_values('water_need')

102
Note: in the older version of pandas, there is a sort() function with a similar mechanism. But it has
been replaced with sort_values() in newer versions, so learn sort_values() and not sort().
The only parameter I used here was the name of the column I want to sort by, in this case
the water_need column. Quite often, you have to sort by multiple columns, so in general, I
recommend using the by keyword for the columns:
zoo.sort_values(by = ['animal', 'water_need'])

103
Note: you can use the by keyword with one column only, too, like zoo.sort_values(by =
['water_need']).
sort_values sorts in ascending order, but obviously, you can change this and do descending order
as well:
zoo.sort_values(by = ['water_need'], ascending = False)

104
Am I the only one who finds it funny that defining descending is possible only as ascending =

False? Whatever.
Reset_index
(This section is especially important for you if you participate in the Junior Data Scientist’s First Month
video course.)
What a mess with all the indexes after that last sorting, right?

105
It’s not just that it’s ugly… wrong indexing can mess up your visualizations (more about that in my
matplotlib tutorials) or even your machine learning models.
The point is: in certain cases, when you have done a transformation on your dataframe, you have
to re-index the rows. For that, you can use the reset_index() method. For instance:
zoo.sort_values(by = ['water_need'], ascending = False).reset_index()

106
Nicer? For sure!
As you can see, our new dataframe kept the old indexes, too. If you want to remove them, just
add the drop = True parameter:
zoo.sort_values(by = ['water_need'], ascending = False).reset_index(drop = True)

107
Fillna
(Note: fillna is basically fill + na in one world. If you ask me, it’s not the smartest name, but this is
what we have.)
Let’s rerun the left-merge method that we have used above:
zoo.merge(zoo_eats, how = 'left')

108
Remember? These are all our animals. The problem is that we have NaN values for
lions. NaN itself can be really distracting, so I usually like to replace it with something more
meaningful. In some cases, this can be a 0 value, or in other cases a specific string value, but this
time, I’ll go with unknown. Let’s use the fillna() function, which basically finds and replaces
all NaN values in our dataframe:
zoo.merge(zoo_eats, how = 'left').fillna('unknown')

109
Note: since we know that lions eat meat, we could have written zoo.merge(zoo_eats, how =
'left').fillna('meat'), as well.
Test yourself
Okay, you’ve gotten through the article! Great job!
Here’s your final test task!
Let’s get back to our article_read dataset.
(Note: Remember, this dataset holds the data of a travel blog. If you don’t have the data yet, you can
download it from here. Or you can go through the whole download, open, store process step by step
by reading the first episode of this pandas tutorial.)
Download another dataset, too: blog_buy. You can do that by running these two lines in your
Jupyter Notebook:
!wget 46.101.230.157/dilan/pandas_tutorial_buy.csv
blog_buy = pd.read_csv('pandas_tutorial_buy.csv', delimiter=';', names = ['my_date_time', 'event',
'user_id', 'amount'])

110
The article_read dataset shows all the users who read an article on the blog, and
the blog_buy dataset shows all the users who bought something on the very same blog
between 2018-01-01 and 2018-01-07.
I have two questions for you:
 TASK #1: What’s the average (mean) revenue between 2018-01-01 and 2018-01-
07 from the users in the article_read dataframe?
 TASK #2: Print the top 3 countries by total revenue between 2018-01-01 and 2018-01-
07! (Obviously, this concerns the users in the article_read dataframe again.)
SOLUTION for TASK #1
The average revenue is: 1.0852
Here’s the code:
step_1 = article_read.merge(blog_buy, how = 'left', left_on = 'user_id', right_on = 'user_id')
step_2 = step_1.amount
step_3 = step_2.fillna(0)
result = step_3.mean()
result

Note: for ease of understanding, I broke this down into “steps” – but you could also bring all these
functions into one line.
A short explanation:
 (On the screenshot, at the beginning, I included the two extra cells where I import pandas and
numpy, and where I read the csv files into my Jupyter Notebook.)

111
 In step_1, I merged the two tables (article_read and blog_buy) based on
the user_id columns. I kept all the readers from article_read, even if they didn’t buy
anything, because 0s should be counted in to the average revenue value. And I removed
everyone who bought something but wasn’t in the article_read dataset (that was fixed in
the task). So all in all that led to a left join.
 In step_2, I removed all the unnecessary columns, and kept only amount.
 In step_3, I replaced NaN values with 0s.
 And eventually I did the .mean() calculation.
SOLUTION for TASK #2
The code is:
step_1 = article_read.merge(blog_buy, how = 'left', left_on = 'user_id', right_on = 'user_id')
step_2 = step_1.fillna(0)
step_3 = step_2.groupby('country').sum()
step_4 = step_3.amount
step_5 = step_4.sort_values(ascending = False)
step_5.head(3)

And find the top 3 countries on the screenshot below.

A short explanation:
 At step_1, I used the same merging method that I used in TASK #1.
 At step_2, I filled up all the NaN values with 0s.
 At step_3, I summarized the numerical values by countries.
 At step_4, I took away all columns but amount.
 And at step_5, I sorted the results in descending order, so I can see my top list!
 Finally, I printed the first 3 lines only.
Solved!
Conclusion
This was the third episode of my pandas tutorials where I showed you my favorite data
formatting tools in pandas: merge, sort, reset_index and fillna. If you want to see more, take a
look at this cool pandas cheat sheet .
And continue here, with episode #4 which is about plotting histograms with Python + pandas.
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)

112
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

113
How to Plot a Histogram in Python (Using Pandas)
Written by Tomi Mester on May 15, 2020
Last updated on October 27, 2020
Plotting a histogram in Python is easier than you’d think! And in this article, I’ll show you how.
I have a strong opinion about visualization in Python, which is: it should be useful and not pretty.
Why? Because the fancy data visualization for high-stakes presentations should happen in tools
that are the best for it: Tableau, Google Data Studio, PowerBI, etc… Creating charts and graphs
natively in Python should serve only one purpose: to make your data science tasks (e.g.
prototyping machine learning models) easier and more intuitive.
So in this tutorial, I’ll focus on how to plot a histogram in Python that’s:
 fast
 easy
 useful
 and yeah… probably not the most beautiful (but not ugly, either).
The tool we will use for that is a function in our favorite Python data analytics library
— pandas — and it’s called .hist()… But more about that in the article!
Download the code base!
Find the whole code base for this article (in Jupyter Notebook format) here:
Plot histograms in Python (GitHub link)
Download it from: here.
Before we get started…
In this article, I assume that you have some basic Python and pandas knowledge.
If you don’t, I recommend starting with these articles:
1. Python libraries and packages for Data Scientists
2. Learn Python from Scratch
3. Pandas Tutorial 1 (Basics)
4. Pandas Tutorial 2 (Aggregation and grouping)
5. Pandas Tutorial 3 (Data Formatting)
Also, this is a hands-on tutorial, so it’s the best if you do the coding part with me!
What is a histogram?
Start with the basics!
What is a histogram and how is it useful?
A histogram shows the number of occurrences of different values in a dataset. At first glance, it
is very similar to a bar chart.
It looks like this:

114
But a histogram is more than a simple bar chart.
Let me give you an example and you’ll see immediately why.
Let’s say that you run a gym and you have 250 clients. For some reason, you want to analyze
their heights. Good!
You have the individual data points – the height of each and every client in one big Python list:
height = [185, 172, 172, 169, 181, 162, 186, 171, 177, 174, 184, 163, 174, 173, 182, 169, 174,
170, 176, 179, 169, 182, 181, 179, 181, 171, 175, 170, 174, 179, 171, 173, 171, 170, 171, 175,
169, 177, 185, 180, 174, 170, 171, 186, 176, 172, 177, 188, 176, 179, 177, 173, 169, 173, 174,
179, 181, 181, 177, 181, 171, 183, 179, 174, 178, 175, 182, 185, 189, 167, 167, 172, 176, 181,
177, 163, 174, 180, 177, 180, 174, 174, 177, 178, 177, 176, 171, 178, 176, 182, 183, 177, 173,

115
172, 178, 176, 173, 176, 172, 180, 173, 183, 178, 179, 169, 177, 180, 170, 174, 176, 167, 177,
181, 170, 178, 168, 175, 166, 182, 178, 175, 171, 183, 187, 164, 183, 185, 178, 168, 181, 174,
172, 168, 179, 180, 172, 179, 169, 180, 176, 174, 175, 181, 180, 179, 176, 176, 179, 177, 180,
174, 161, 182, 189, 178, 175, 175, 175, 176, 169, 172, 170, 177, 174, 178, 174, 181, 177, 189,
164, 172, 181, 191, 174, 176, 174, 183, 174, 180, 174, 168, 177, 179, 183, 175, 172, 179, 177,
177, 175, 182, 178, 187, 182, 179, 166, 179, 178, 180, 182, 173, 180, 172, 187, 168, 165, 166,
170, 169, 187, 174, 167, 182, 172, 168, 181, 179, 173, 184, 176, 185, 179, 185, 176, 168, 190,
172, 174, 171, 174, 177, 177, 179, 186, 175, 168, 168, 172, 165, 180, 173, 174, 175, 167, 170,
180, 179, 173, 186, 168]
Note: it’s in centimeters, folks!
Looking at 250 data points is not very intuitive, is it?
As we’ve discussed in the statistical averages and statistical variability articles, you have to
“compress” these numbers into a few values that are easier to understand yet describe your
dataset well enough. These could be:
 mean: 175.952
 median: 176
 mode: 174
 standard deviation: 5.65
 10% percentile: 168
 90% percentile: 183
Based on these values, you can get a pretty good sense of your data…
But if you plot a histogram, too, you can also visualize the distribution of your data points.
For this dataset above, a histogram would look like this:

116
It’s very visual, very intuitive and tells you even more than the averages and variability measures
above. I love it!
Bins and ranges. A histogram is not the same as a bar chart!
You most probably realized that in the height dataset we have ~25-30 unique values. If you
simply counted the unique values in the dataset and put that on a bar chart, you would have
gotten this:

117
Bar chart that shows the frequency of unique values in the dataset
But when you plot a histogram, there’s one more initial step: these unique values will be
grouped into ranges. These ranges are called bins or buckets — and in Python, the default
number of bins is 10. So after the grouping, your histogram looks like this:

118
As I said: pretty similar to a bar chart — but not the same!
When is this grouping-into-ranges concept useful?
For instance when you have way too many unique values in your dataset. (In big data projects, it
won’t be ~25-30 as it was in our example… more like 25-30 *million* unique values.)
For instance, let’s imagine that you measure the heights of your clients with a laser meter and
you store first decimal values, too. Like this:
height = [185.7, 172.3, 172.8, 169.6, 181.2, 162.2, 186.5, 171.4, 177.9, 174.5, 184.8, 163.6,
174.1, 173.7, 182.8, 169.4, 175.0, 170.7, 176.3, 179.5, 169.4, 182.9, 181.4, 179.0, 181.4, 171.9,
175.3, 170.4, 174.4, 179.2, 171.9, 173.6, 171.9, 170.9, 172.0, 175.9, 169.3, 177.4, 186.0, 180.5,
174.8, 170.7, 171.5, 186.2, 176.3, 172.2, 177.1, 188.6, 176.7, 179.7, 177.8, 173.9, 169.1, 173.9,

119
174.7, 179.5, 181.0, 181.6, 177.7, 181.3, 171.5, 183.5, 179.1, 174.2, 178.9, 175.5, 182.8, 185.1,
189.1, 167.6, 167.3, 173.0, 177.0, 181.3, 177.9, 163.9, 174.2, 181.0, 177.4, 180.6, 174.7, 174.8,
177.1, 178.5, 177.2, 176.7, 172.0, 178.3, 176.7, 182.8, 183.2, 177.1, 173.7, 172.2, 178.5, 176.5,
173.9, 176.3, 172.3, 180.2, 173.3, 183.3, 178.4, 179.6, 169.4, 177.0, 180.4, 170.3, 174.4, 176.2,
167.8, 177.9, 181.1, 170.8, 178.1, 168.1, 175.8, 166.3, 182.7, 178.5, 175.9, 171.3, 183.6, 187.8,
164.9, 183.4, 185.8, 178.0, 168.8, 181.2, 174.9, 172.4, 168.6, 179.3, 180.8, 172.3, 179.1, 169.1,
180.8, 176.3, 174.9, 175.4, 181.2, 180.5, 179.2, 176.8, 176.5, 179.7, 177.4, 180.1, 174.1, 161.4,
182.2, 189.1, 178.6, 175.4, 175.2, 175.3, 176.1, 169.3, 172.9, 170.0, 177.5, 174.2, 179.0, 175.0,
181.9, 177.3, 189.1, 164.6, 172.1, 181.4, 191.2, 174.5, 176.3, 174.6, 184.0, 174.3, 180.1, 174.1,
168.4, 177.9, 179.0, 183.8, 175.3, 172.3, 179.4, 177.4, 177.7, 175.6, 183.0, 178.2, 187.4, 182.7,
180.0, 166.2, 179.6, 178.5, 180.9, 182.3, 173.6, 180.9, 172.6, 187.7, 168.0, 165.4, 166.1, 170.7,
169.3, 187.7, 174.0, 167.9, 182.7, 172.5, 168.6, 181.3, 179.7, 173.4, 184.4, 176.8, 185.7, 179.0,
185.4, 176.7, 168.7, 190.7, 172.7, 174.8, 171.8, 174.8, 177.5, 177.2, 180.0, 186.8, 175.3, 168.6,
168.9, 172.0, 166.0, 181.0, 173.0, 174.1, 176.0, 167.6, 170.8, 180.0, 179.7, 173.3, 186.9, 168.2]
This is the very same dataset as it was before… only one decimal more accurate.
But because of that tiny difference, now you have not ~25 but ~150 unique values. So if you
count the occurrences of each value and put it on a bar chart now, you would get this:

120
Ouch…
A histogram, though, even in this case, conveniently does the grouping for you. You get values
that are close to each other counted and plotted as values of given ranges/bins:

121
Beautiful… but more importantly: useful!
How to plot a histogram in Python (step by step)
Now that you know the theory, what a histogram is and why it is useful, it’s time to learn how to
plot one using Python. There are many Python libraries that can do so:
 pandas
 matplotlib
 seaborn
 …

122
But I’ll go with the simplest solution: I’ll use the .hist() function that’s built into pandas. As I said
in the introduction: you don’t have to do anything fancy here… You rather need a histogram
that’s useful and informative for you — and for your data science tasks.
Anyway, the .hist() pandas function is built on top of the original matplotlib solution. (See more
info in the documentation.) So the result and the visual you’ll get is more or less the same that
you’d get by using matplotlib… The syntax will be also similar but a little bit closer to the logic
that you got used to in pandas. So in my opinion, it’s better for your learning curve to get familiar
with this solution.
Either way, let’s see how this works!
Note: if you are looking for something eye-catching, check out the seaborn Python dataviz library.
Step #1: Import pandas and numpy, and set matplotlib
One of the advantages of using the built-in pandas histogram function is that you don’t have
to import any other libraries than the usual: numpy and pandas.
At the very beginning of your project (and of your Jupyter Notebook), run these two lines:
import numpy as np
import pandas as pd
Great! numpy and pandas are imported and ready to use.
And don’t forget to add the:
%matplotlib inline
line, either — so you can plot your charts into your Jupyter Notebook.

Step #2: Get the data!

As I said, in this tutorial, I assume that you have some basic Python and pandas knowledge. So I
also assume that you know how to access your data using Python. (If you don’t, go back to the top
of this article and check out the tutorials I linked there.)
For this tutorial, you don’t have to open any files — I’ve used a random generator to generate the
data points of the height data set.
If you want to work with the exact same dataset as I do (and I recommend doing so), copy-paste
these lines into a cell of your Jupyter Notebook:
mu = 168 #mean
sigma = 5 #stddev
sample = 250
np.random.seed(0)
height_f = np.random.normal(mu, sigma, sample).astype(int)

123
mu = 176 #mean
sigma = 6 #stddev
sample = 250
np.random.seed(1)
height_m = np.random.normal(mu, sigma, sample).astype(int)
Run them!

For now, you don’t have to know what exactly happened above. (I’ll write a separate article
about the np.random function.) Just know that this generated two datasets, with 250 data points
in each. And because I fixed the parameter of the random generator (with
the np.random.seed() line), you’ll get the very same numpy arrays with the very same data points
that I have.
In the height_f dataset you’ll get 250 height values of female clients of our hypothetical gym.

124
In the height_m dataset there are 250 height values of male clients.

125
Step #3: Prepare the data!
The more complex your data science project is, the more things you should do before you can
actually plot a histogram in Python.
Preparing your data is usually more than 80% of the job…
But in this simpler case, you don’t have to worry about data cleaning (removing duplicates, filling
empty values, etc.). You just need to turn your height_m and height_f data into a pandas
DataFrame.
Run this line:
gym = pd.DataFrame({'height_f': height_f, 'height_m': height_m})
Great:
We have the heights of female and male gym members in one big 250-row dataframe.
gym

126
[OPTIONAL] Basics: Plotting line charts and bar charts in Python using pandas
Before we plot the histogram itself, I wanted to show you how you would plot a line chart and a
bar chart that shows the frequency of the different values in the data set… so you’ll be able to
compare the different approaches.
And of course, if you have never plotted anything in pandas before, creating a simpler line chart
first can be handy.
To put your data on a chart, just type the .plot() function right after the pandas dataframe you
want to visualize. By default, .plot() returns a line chart.
If you plot() the gym dataframe as it is:
gym.plot()
you’ll get this:

127
Uhh. Messy.
On the y-axis, you can see the different values of the height_m and height_f datasets. And the x-
axis shows the indexes of the dataframe — which is not very useful in this case.
So let’s tweak this further!
To get what we wanted to get (plot the occurrence of each unique value in the dataset), we have
to work a bit more with the original dataset. Let’s add a .groupby() with a .count() aggregate
function. (I wrote more about these in this pandas tutorial.)
gym.groupby('height_m').count()
If you plot the output of this, you’ll get a much nicer line chart:
gym.groupby('height_m').count().plot()

128
frequency of values
This is closer to what we wanted… except that line charts are to show trends. If you want to
compare different values, you should use bar charts instead.
To turn your line chart into a bar chart, just add the bar keyword:
gym.groupby('height_m').count().plot.bar()
or:
gym.groupby('height_m').count().plot(kind='bar')

129
And of course, you should run this for the height_f dataset, separately:
gym.groupby('height_f').count().plot.bar()

130
This is how you visualize the occurrence of each unique value on a bar chart in Python…
But this is still not a histogram, right!?
So…
Step #4: Plot a histogram in Python!
Once you have your pandas dataframe with the values in it, it’s extremely easy to put that on a
histogram.
Type this:
gym.hist()

131
plotting histograms in Python
Yepp, compared to the bar chart solution above, the .hist() function does a ton of cool things for
you, automatically:
1. It does the grouping.
When using .hist() there is no need for the initial .groupby() function! .hist() automatically
groups your data into bins. (By default, into 10 bins.)
Note: again, “grouping into bins” is not the same as “grouping by unique values” — as a bin
usually contains a range of values.
2. It does the counting. (No need for .count() function either.)
3. It plots a histogram for each column in your dataframe that has numerical values in it.
So plotting a histogram (in Python, at least) is definitely a very convenient way to visualize the
distribution of your data.
If you want a different amount of bins/buckets than the default 10, you can set that as a
parameter. E.g:
gym.hist(bins=20)

132
Bonus: Plot your histograms on the same chart!
Sometimes, you want to plot histograms in Python to compare two different columns of your
dataframe.
In that case, it’s handy if you don’t put these histograms next to each other — but on the very
same chart.
It can be done with a small modification of the code that we have used in the previous section.
gym.plot.hist(bins=20)

133
Note: in this version, you called the .hist() function from .plot.
Anyway, since these histograms are overlapping each other, I recommend setting their
transparency to 70% by using the alpha parameter:
gym.plot.hist(bins=20, alpha=0.7)

134
So you can see both charts.
Conclusion
This is it!
Just as I promised: plotting a histogram in Python is easy… as long as you want to keep it simple.
You can make this complicated by adding more parameters to display everything more nicely.
But you don’t have to…
Anyway, these were the basics. Just use the .hist() or the .plot.hist() functions on the dataframe
that contains your data points and you’ll get beautiful histograms that will show you the
distribution of your data.

135
And don’t stop here, continue with the pandas tutorial episode #5 where I’ll show you how to
plot a scatter plot in pandas.
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

136
Pandas tutorial 5: Scatter plot with pandas and matplotlib
Written by Tomi Mester on June 11, 2020
Scatter plots are frequently used in data science and machine learning projects. In this pandas
tutorial, I’ll show you two simple methods to plot one. Both solutions will be equally useful and
quick:
 one will be using pandas (more precisely: pandas.plot.scatter())
 the other one using matplotlib (matplotlib.pyplot.scatter())
Let’s see them — and as usual: I’ll guide you through step by step.
Note: If you don’t know anything about pandas (or Python), you might want to start here:
1. Python libraries and packages for Data Scientists
2. Learn Python from Scratch
3. Pandas Tutorial 1 (Basics)
4. Pandas Tutorial 2 (Aggregation and grouping)
5. Pandas Tutorial 3 (Data Formatting)
6. Pandas Tutorial 4 (Plotting in pandas: Bar Chart, Line Chart, Histogram)
Download the code base!
This is a hands-on tutorial, so it’s best if you do the coding part with me!
You can also find the whole code base for this article (in Jupyter Notebook format) here: Scatter
plot in Python.
You can download it from: here.
What is a scatter plot? And what is it good for?
Scatter plots are used to visualize the relationship between two (or sometimes three) variables
in a data set. The idea is simple:
 you take a data point,
 you take two of its variables,
 the y-axis shows the value of the first variable,
 the x-axis shows the value of the second variable
Following this concept, you display each and every datapoint in your dataset. You’ll get
something like this:

137
Boom! This is a scatter plot. At least, the easiest (and most common) example of it.
This particular scatter plot shows the relationship between the height and weight of people from
a random sample. Again:
 y-axis shows the height
 x-axis shows the weight
 and each blue dot represents a person in this dataset
So, for instance, this person’s (highlighted with red) weight and height is 66.5 kg and 169 cm.

138
How to read a scatter plot?
Scatter plots play an important role in data science – especially in building/prototyping machine
learning models. Looking at the chart above, you can immediately tell that there’s a strong
correlation between weight and height, right? As we discussed in my linear regression article,
you can even fit a trend line (a.k.a. regression line) to this data set and try to describe this
relationship with a mathematical formula.
Something like this:

139
Note: this article is not about regression machine learning models, but if you want to get started with
that, go here: Linear Regression in Python using numpy + polyfit (with code base)
This above is called a positive correlation. The greater is the height value, the greater is the
expected weight value, too. (Of course, this is a generalization of the data set. There are always
exceptions and outliers!)
But it’s also possible that you’ll get a negative correlation:

140
And in real-life data science projects, you’ll see no correlation often, too:

141
Anyway: if you see a sign of positive or negative correlation between two variables in a data
science project, that’s a good indicator that you found something interesting — something that’s
worth digging deeper into. Well, in 99% of cases it will turn out to be either a triviality, or a
coincidence. But in the remaining 1%, you might find gold!
Okay, I hope I set your expectations about scatter plots high enough.
It’s time to see how to create one in Python!
Scatter plot in pandas and matplotlib
As I mentioned before, I’ll show you two ways to create your scatter plot.
You’ll see here the Python code for:

142
 a pandas scatter plot
and
 a matplotlib scatter plot
The two solutions are fairly similar, the whole process is ~90% the same… The only difference is
in the last few lines of code.
Note: By the way, I prefer the matplotlib solution because I find it a bit more transparent.
I’ll guide you through these 4 steps:
1. Importing pandas, numpy and matplotlib
2. Getting the data
3. Preparing the data
4. Plotting a scatter plot
Step #1: Import pandas, numpy and matplotlib!
Just as we have done in the histogram article, as a first step, you’ll have to import the libraries
you’ll use. And you’ll also have to make a small tweak in your Jupyter environment.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
The first two lines will import pandas and numpy.
The third line will import the pyplot from matplotlib — also, we will refer to it as plt.
And %matplotlib inline sets your environment so you can directly plot charts into your Jupyter
Notebook!
Great!

Step #2: Get the data!

Well, in real data science projects, getting the data would be a bit harder. You should read .csv
files or SQL tables into your Python environment. But this tutorial’s focus is not on learning that
— so you can take the lazy way and use the dataset I’ll provide for you here.
Note: What’s in the data? This is the modified version of the dataset that we used in the pandas
histogram article — the heights and weights of our hypothetical gym’s members.
np.random.seed(0)
mu = 170 #mean
sigma = 6 #stddev
sample = 100
height = np.random.normal(mu, sigma, sample)
weight = (height-100) * np.random.uniform(0.75, 1.25, 100)
This is a random generator, by the way, that generates 100 height and 100 weight values — in
numpy array format. By using the np.random.seed(0) line, we also made sure you’ll be able to
work with the exact same data points that I do in this article.

143
Note: For now, you don’t have to know line by line what’s going on here. (I’ll write a separate article
about how numpy.random works.)
In the next step, we will push these data sets into pandas dataframes.
Step #3: Prepare the data!
Again, preparing, cleaning and formatting the data is a painful and time consuming process in
real-life data science projects. But in this tutorial, we are lucky, everything is prepared – the data
is clean – so you can push your height and weight data sets directly into a pandas dataframe
(called gym) by running this one line of code:
gym = pd.DataFrame({'height': height, 'weight': weight})
Note: If you want to experience the complexity of a true-to-life data science project, go and check out
my 6-week course: The Junior Data Scientist’s First Month!
Your gym dataframe should look like this.

144
Perfect: ready for putting it on a scatter plot!
Step #4a: Pandas scatter plot
Okay, all set, we have the gym dataframe. Let’s create a pandas scatter plot!
Now, this is only one line of code and it’s pretty similar to what we had for bar charts, line charts
and histograms in pandas…
It starts with: gym.plot …and then you simply have to define the chart type that you want to plot,
which is scatter(). But when plotting a scatter plot in pandas, you’ll always have to specify
the x and y values as parameters, too. (This could seem unusual because for bar and line charts, you
didn’t have to do anything similar to this.)
So the final line of code will be:
gym.plot.scatter(x = 'weight', y = 'height')
The x and y values – by definition – have to come from the gym dataframe, so you have to refer
to the column names: 'weight' and 'height'!

145
A quick comment: Watch out for all the apostrophes! I know from my live workshops that the
syntax might seem tricky at first. But you’ll get used to it after your 5th or 6th scatter plot, I

promise!
That’s it! You have plotted a scatter plot in pandas!

Step #4b: Matplotlib scatter plot

Here’s an alternative solution for the last step. In this one, we will use the matplotlib library
instead of pandas. (Although, I have to mention here that the pandas solution I showed you is
actually built on matplotlib’s code.)

146
In my opinion, this solution is a bit more elegant. But from a technical standpoint — and for
results — both solutions are equally great.
Anyway, type and run these three lines:
x = gym.weight
y = gym.height
plt.scatter(x,y)
 x = gym.weight
This line defines what values will be displayed on the x-axis of the scatter plot. It’s
the weight column again from the gym dataset. (Note: This is in pandas Series format… But
in this specific case, I could have added the original numpy array, too.)
 y = gym.height
On the y-axis we want to display the gym.height values. (This is in pandas Series format,
too!)
 plt.scatter(x,y)
And then this line does the plotting. Remember, you set plt at the very beginning of our
Jupyter notebook (import matplotlib.pyplot as plt) — and so plt refers
to matplotlib.pyplot! And the x and y values are parameters that have been defined in the
previous two lines.

Again: this is slightly different (and in my opinion slightly nicer) syntax than with pandas.
But the result is exactly the same.

147
Conclusion
This is how you make a scatter plot in pandas and/or in matplotlib. I think it’s fairly easy and I
hope you think the same. If you haven’t done so yet, check out my Python histogram tutorial,
too! If you have any questions, leave a comment below!
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

148
Linear Regression in Python using numpy + polyfit (with
code base)
Written by Tomi Mester on February 20, 2020
Last updated on March 09, 2020
I always say that learning linear regression in Python is the best first step towards machine
learning. Linear regression is simple and easy to understand even if you are relatively new to
data science. So spend time on 100% understanding it! If you get a grasp on its logic, it will serve
you as a great foundation for more complex machine learning concepts in the future.
In this tutorial, I’ll show you everything you’ll need to know about it: the mathematical
background, different use-cases and most importantly the implementation. We will do that in
Python — by using numpy (polyfit).
Note: This is a hands-on tutorial. I highly recommend doing the coding part with me! If you haven’t
done so yet, you might want to go through these articles first:
1. How to install Python, R, SQL and bash to practice data science!
2. Python libraries and packages for Data Scientists
3. Learn Python from Scratch
Download the code base!
Find the whole code base for this article (in Jupyter Notebook format) here:
Linear Regression in Python (using Numpy polyfit)
Download it from: here.
The mathematical background
Remember when you learned about linear functions in math classes?
I have good news: that knowledge will become useful after all!
Here’s a quick recap!
For linear functions, we have this formula:
y = a*x + b
In this equation, usually, a and b are given. E.g:
a=2
b=5
So:
y = 2*x + 5
Knowing this, you can easily calculate all y values for given x values.
E.g.

when x is… y is…

0 2*0 + 5 = 5

1 2*1 + 5 = 7

2 2*2 + 5 = 9

3 2*3 + 5 = 11

4 2*4 + 5 = 13

149
…
If you put all the x–y value pairs on a graph, you’ll get a straight line:

The relationship between x and y is linear.

Using the equation of this specific line (y = 2 * x + 5), if you change x by 1, y will always change
by 2.
And it doesn’t matter what a and b values you use, your graph will always show the same
characteristics: it will always be a straight line, only its position and slope change. It also means
that x and y will always be in linear relationship.
In the linear function formula:
y = a*x + b

150
 The a variable is often called slope because – indeed – it defines the slope of the red line.
 The b variable is called the intercept. b is the value where the plotted line intersects the y-
axis. (Or in other words, the value of y is b when x = 0.)

This is all you have to know about linear functions for now…
But why did I talk so much about them?
Because linear regression is nothing else but finding the exact linear function equation (that is:
finding the a and b values in the y = a*x + b formula) that fits your data points the best.
Note: Here’s some advice if you are not 100% sure about the math. The most intuitive way to
understand the linear function formula is to play around with its values. Change the a and b variables
above, calculate the new x-y value pairs and draw the new graph. Repeat this as many times as

151
necessary. (Tip: try out what happens when a = 0 or b = 0!) By seeing the changes in the value pairs
and on the graph, sooner or later, everything will fall into place.
A typical linear regression example
Machine learning – just like statistics – is all about abstractions. You want to simplify reality so
you can describe it with a mathematical formula. But to do so, you have to ignore natural
variance — and thus compromise on the accuracy of your model.
If this sounds too theoretical or philosophical, here’s a typical linear regression example!
We have 20 students in a class and we have data about a specific exam they have taken. Each
student is represented by a blue dot on this scatter plot:
 the X axis shows how many hours a student studied for the exam
 the Y axis shows the scores that she eventually got

152
E.g. she studied 24 hours and her test result was 58%:

153
We have 20 data points (20 students) here.
By looking at the whole data set, you can intuitively tell that there must be a correlation between
the two factors. If one studies more, she’ll get better results on her exam. But you can see the
natural variance, too. For instance, these 3 students who studied for ~30 hours got very different
scores: 74%, 65% and 40%.

154
Anyway, let’s fit a line to our data set — using linear regression:

155
Nice, we got a line that we can describe with a mathematical equation – this time, with a linear
function. The general formula was:
y=a*x+b
And in this specific case, the a and b values of this line are:
a = 2.01
b = -3.9
So the exact equation for the line that fits this dataset is:
y = 2.01*x - 3.9
And how did I get these a and b values? By using machine learning.

156
If you know enough x–y value pairs in a dataset like this one, you can use linear regression
machine learning algorithms to figure out the exact mathematical equation (so
the a and b values) of your linear function.
Linear regression terminology
Before we go further, I want to talk about the terminology itself — because I see that it confuses
many aspiring data scientists. Let’s fix that here!
Okay, so one last time, this was our linear function formula:
y = a*x + b
The a and b variables:
The a and b variables in this equation define the position of your regression line and I’ve already
mentioned that the a variable is called slope (because it defines the slope of your line) and
the b variable is called intercept.
In the machine learning community the a variable (the slope) is also often called the regression
coefficient.
The x and y variables:
The x variable in the equation is the input variable — and y is the output variable.
This is also a very intuitive naming convention. For instance, in this equation:
y = 2.01*x - 3.9
If your input value is x = 1, your output value will be y = -1.89.
But in machine learning these x-y value pairs have many alternative names… which can cause
some headaches. So here are a few common synonyms that you should know:
 input variable (x) – output variable (y)
 independent variable (x) – dependent variable (y)
 predictor variable (x) – predicted variable (y)
 feature (x) – target (y)
See, the confusion is not an accident… But at least, now you have your linear regression
dictionary here.
How does linear regression become useful?
Having a mathematical formula – even if it doesn’t 100% perfectly fit your data set – is useful for
many reasons.
1. Predictions: Based on your linear regression model, if a student tells you how much she
studied for the exam, you can come up with a pretty good estimate: you can predict her
results even before she writes the test. Let’s say someone studied 20 hours; it means that
her predicted test result will be 2.01 * 20 - 3.9 = 36.3.
2. Outliers: If something unexpected shows up in your dataset – someone is way too far
from the expected range…

157
… let’s say, someone who studied only 18 hours but got almost 100% on the exam… Well, that
student is either a genius — or a cheater. But she’s definitely worth the teachers’ attention,

right? By the way, in machine learning, the official name of these data points is outliers.
And both of these examples can be translated very easily to real life business use-cases, too!
Predictions are used for: sales predictions, budget estimations, in manufacturing/production, in
the stock market and in many other places. (Although, usually these fields use more sophisticated
models than simple linear regression.)
Finding outliers is great for fraud detection. And it’s widely used in the fintech industry. (E.g.
preventing credit card fraud.)

158
The limitations of machine learning models
It’s good to know that even if you find a very well-fitting model for your data set, you have to
count on some limitations.
Note: These are true for essentially all machine learning algorithms — not only for linear regression.
Limitation #1: a model is never a perfect fit
As I said, fitting a line to a dataset is always an abstraction of reality. Describing something with a
mathematical formula is sort of like reading the short summary of Romeo and Juliet. You’ll get
the essence… but you will miss out on all the interesting, exciting and charming details.
Similarly in data science, by “compressing” your data into one simple linear function comes with
losing the whole complexity of the dataset: you’ll ignore natural variance.
But in many business cases, that can be a good thing. Your mathematical model will be simple
enough that you can use it for your predictions and other calculations.
Note: One big challenge of being a data scientist is to find the right balance between a too-simple and
an overly complex model — so the model can be as accurate as possible. (This problem even has a
name: bias-variance tradeoff, and I’ll write more about this in a later article.)
But a machine learning model – by definition – will never be 100% accurate.
Limitation #2: you can’t go beyond the range of your historical data
Many data scientists try to extrapolate their models and go beyond the range of their data.
For instance, in our case study above, you had data about students studying for 0-50 hours. The
dataset hasn’t featured any student who studied 60, 80 or 100 hours for the exam. These values
are out of the range of your data. If you wanted to use your model to predict test results for
these “extreme” x values… well you would get nonsensical y values:
E.g. your model would say that someone who has studied x = 80 hours would get:
y = 2.01*80 - 3.9 = 159% on the test.

159
…but 100% is the obvious maximum, right?
The point is that you can’t extrapolate your regression model beyond the scope of the data that
you have used creating it. Well, in theory, at least...
Because I have to admit, that in real life data science projects, sometimes, there is no way around
it. If you have data about the last 2 years of sales — and you want to predict the next month, you
have to extrapolate. Even so, we always try to be very careful and don’t look too far into the
future. The further you get from your historical data, the worse your model’s accuracy will be.
Linear Regression in Python
Okay, now that you know the theory of linear regression, it’s time to learn how to get it done in
Python!

160
Let’s see how you can fit a simple linear regression model to a data set!
Well, in fact, there is more than one way of implementing linear regression in Python. Here, I’ll
present my favorite — and in my opinion the most elegant — solution. I’ll use numpy and
its polyfit method.
We will go through these 6 steps:
1. Importing the Python libraries we will use
2. Getting the data
3. Defining x values (the input variable) and y values (the output variable)
4. Machine Learning: fitting the model
5. Interpreting the results (coefficient, intercept) and calculating the accuracy of the model
6. Visualization (plotting a graph)
Note: You might ask: “Why isn’t Tomi using sklearn in this tutorial?” I know that (in online tutorials at
least) Numpy and its polyfit method is less popular than the Scikit-learn alternative… true. But in my
opinion, numpy‘s polyfit is more elegant, easier to learn — and easier to maintain in
production! sklearn‘s linear regression function changes all the time, so if you implement it in
production and you update some of your packages, it can easily break. I don’t like that. Besides, the
way it’s built and the extra data-formatting steps it requires seem somewhat strange to me. In my
opinion, sklearn is highly confusing for people who are just getting started with Python machine
learning algorithms. (By the way, I had the sklearn LinearRegression solution in this tutorial… but I
removed it. That’s how much I don’t like it. So trust me, you’ll like numpy + polyfit better, too. :-))
Linear Regression in Python – using numpy + polyfit
Fire up a Jupyter Notebook and follow along with me!
Note: Find the code base here and download it from here.
STEP #1 – Importing the Python libraries
Before anything else, you want to import a few common data science libraries that you will use
in this little project:
 numpy
 pandas (you will store your data in pandas DataFrames)
 matplotlib.pyplot (you will use matplotlib to plot the data)
Note: if you haven’t installed these libraries and packages to your remote server, find out how to do
that in this article.
Start with these few lines:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
(The %matplotlib inline is there so you can plot the charts right into your Jupyter Notebook.)
To be honest, I almost always import all these libraries and modules at the beginning of my
Python data science projects, by default. But apart from these, you won’t need any extra
libraries: polyfit — that we will use for the machine learning step — is already imported
with numpy.

161
STEP #2 – Getting the data
The next step is to get the data that you’ll work with. In this case study, I prepared the data and
you just have to copy-paste these two lines to your Jupyter Notebook:
students = {'hours': [29, 9, 10, 38, 16, 26, 50, 10, 30, 33, 43, 2, 39, 15, 44, 29, 41, 15, 24, 50],
'test_results': [65, 7, 8, 76, 23, 56, 100, 3, 74, 48, 73, 0, 62, 37, 74, 40, 90, 42, 58, 100]}

student_data = pd.DataFrame(data=students)
This is the very same data set that I used for demonstrating a typical linear regression example at
the beginning of the article. You know, with the students, the hours they studied and the test
scores.
Just print the student_data DataFrame and you’ll see the two columns with the value-pairs we
used.

162
 the hours column shows how many hours each student studied
 and the test_results column shows what their test results were
(So one line is one student.)
Of course, in real life projects, we instead open .csv files (with the read_csv function) or SQL
tables (with read_sql)… Regardless, the final format of the cleaned and prepared data will be a
similar dataframe.
So this is your data, you will fine-tune it and make it ready for the machine learning step.
Note: And another thought about real life machine learning projects… In this tutorial, we are working
with a clean dataset. That’s quite uncommon in real life data science projects. A big part of the data
scientist’s job is data cleaning and data wrangling: like filling in missing values, removing duplicates,
fixing typos, fixing incorrect character coding, etc. Just so you know.
STEP #3 – Defining the feature and target values
Okay, so we have the data set.
But we have to tweak it a bit — so it can be processed by numpy‘s linear regression function.
The next required step is to break the dataframe into:
 input (x) values: this will be the hours column
 and output (y) values: and this is the test_results column
polyfit requires you to define your input and output variables in 1-dimensional format. For that,
you can use pandas Series.

163
Let’s type this into the next cell of your Jupyter notebook:
x = student_data.hours
y = student_data.test_results
Okay, the input and output — or, using their fancy machine learning names,
the feature and target — values are defined.
At this step, we can even put them onto a scatter plot, to visually understand our dataset.
It’s only one extra line of code:
plt.scatter(x,y)

And I want you to realize one more thing here: so far, we have done zero machine learning… This
was only old-fashioned data preparation.
STEP #4 – Machine Learning: Linear Regression (line fitting)
We have the x and y values… So we can fit a line to them!

164
The process itself is pretty easy.
Type this one line:
model = np.polyfit(x, y, 1)
This executes the polyfit method from the numpy library that we have imported before. It needs
three parameters: the previously defined input and output variables (x, y) — and an integer,
too: 1. This latter number defines the degree of the polynomial you want to fit.
Using polyfit, you can fit second, third, etc… degree polynomials to your dataset, too. (That’s not
called linear regression anymore — but polynomial regression. Anyway, more about this in a later
article…)
But for now, let’s stick with linear regression and linear models – which will be a first degree
polynomial. So you should just put: 1.
When you hit enter, Python calculates every parameter of your linear regression model and
stores it into the model variable.
This is it, you are done with the machine learning step!

Machine Learning is only one line…

Note: isn’t it fascinating all the hype there is around machine learning — especially now that it turns
that it’s less than 10% of your code? (In real life projects, it’s more like less than 1%.) The real (data)
science in machine learning is really what comes before it (data preparation, data cleaning) and what
comes after it (interpreting, testing, validating and fine-tuning the model).
STEP #4 side-note: What’s the math behind the line fitting?
Now, of course, fitting the model was only one line of code — but I want you to see what’s under
the hood. How did polyfit fit that line?
It used the ordinary least squares method (which is often referred to with its short form: OLS). It
is one of the most commonly used estimation methods for linear regression. There are a few
more. But the ordinary least squares method is easy to understand and also good enough in 99%
of cases.
Let’s see how OLS works!
When you fit a line to your dataset, for most x values there is a difference between the y value
that your model estimates — and the real y value that you have in your dataset. In machine
learning, this difference is called error.
Let’s see an example!
Here’s a visual of our dataset (blue dots) and the linear regression model (red line) that you have
just created. (I’ll show you soon how to plot this graph in Python — but let’s focus on OLS for
now.)

165
Let’s take a data point from our dataset.
x = 24
In the original dataset, the y value for this datapoint was y = 58. But when you fit a simple linear
regression model, the model itself estimates only y = 44.3. The difference between the two is
the error for this specific data point.

166
So the ordinary least squares method has these 4 steps:
1) Let’s calculate all the errors between all data points and the model.

167
2) Let’s square each of these error values!
3) Then sum all these squared values!
4) Find the line where this sum of the squared errors is the smallest possible value.
That’s OLS and that’s how line fitting works in numpy polyfit‘s linear regression solution.
STEP #5 – Interpreting the results
Okay, so you’re done with the machine learning part. Let’s see what you got!
First, you can query the regression coefficient and intercept values for your model. You just have
to type:
model
Note: Remember, model is a variable that we used at STEP #4 to store the output of np.polyfit(x, y, 1).

168
The output is:
array([ 2.01467487, -3.9057602 ])
These are the a and b values we were looking for in the linear function formula.
2.01467487 is the regression coefficient (the a value) and -3.9057602 is the intercept
(the b value).

So we finally got our equation that describes the fitted line. It is:
y = 2.01467487 * x - 3.9057602
If a student tells you how many hours she studied, you can predict the estimated results of her
exam. Quite awesome!
You can do the calculation “manually” using the equation.
But there is a simple keyword for it in numpy — it’s called poly1d():
predict = np.poly1d(model)
hours_studied = 20
predict(hours_studied)
The result is: 36.38773723

Note: This is the exact same result that you’d have gotten if you put the hours_studied value in the
place of the x in the y = 2.01467487 * x - 3.9057602 equation.
So from this point on, you can use these coefficient and intercept values – and
the poly1d() method – to estimate unknown values.
And this is how you do predictions by using machine learning and simple linear regression in
Python.
Well, okay, one more thing…
There are a few methods to calculate the accuracy of your model. In this article, I’ll show you
only one: the R-squared (R2) value. I won’t go into the math here (this article has gotten pretty

169
long already)… it’s enough if you know that the R-squared value is a number between 0 and 1.
And the closer it is to 1 the more accurate your linear regression model is.
Unfortunately, R-squared calculation is not implemented in numpy… so that one should be
borrowed from sklearn (so we can’t completely ignore Scikit-learn after all :-)):
from sklearn.metrics import r2_score
r2_score(y, predict(x))

And now we know our R-squared value is 0.877.

That’s pretty nice!
STEP #6 – Plotting the linear regression model
Visualization is an optional step but I like it because it always helps to understand the
relationship between our model and our actual data.
Thanks to the fact that numpy and polyfit can handle 1-dimensional objects, too, this won’t be
too difficult.
Type this into the next cell of your Jupyter Notebook:
x_lin_reg = range(0, 51)
y_lin_reg = predict(x_lin_reg)
plt.scatter(x, y)
plt.plot(x_lin_reg, y_lin_reg, c = 'r')

170
Here’s a quick explanation:
 x_lin_reg = range(0, 51)
This sets the range you want to display the linear regression model over — in our case it’s
between 0 and 50 hours.
 y_lin_reg = predict(x_lin_reg)
This calculates the y values for all the x values between 0 and 50.
 plt.scatter(x, y)
This plots your original dataset on a scatter plot. (The blue dots.)

171
 plt.plot(x_lin_reg, y_lin_reg, c = 'r')
And this line eventually prints the linear regression model — based on
the x_lin_reg and y_lin_reg values that we set in the previous two lines. (c = 'r' means that
the color of the line will be red.)
Nice, you are done: this is how you create linear regression in Python using numpy and polyfit.
This was only your first step toward machine learning
You are done with building a linear regression model!
But this was only the first step. In fact, this was only simple linear regression. But there
is multiple linear regression (where you can have multiple input variables), there
is polynomial regression (where you can fit higher degree polynomials) and many many more
regression models that you should learn. Not to speak of the different classification models,
clustering methods and so on…
Here, I haven’t covered the validation of a machine learning model (e.g. when you break your
dataset into a training set and a test set), either. But I’m planning to write a separate tutorial
about that, too.
Anyway, I’ll get back to all these, here, on the blog!
So stay tuned!
Conclusion
Linear regression is the most basic machine learning model that you should learn.
If you understand every small bit of it, it’ll help you to build the rest of your machine learning
knowledge on a solid foundation.
Knowing how to use linear regression in Python is especially important — since that’s the
language that you’ll probably have to use in a real life data science project, too.
This article was only your first step! So stay with me and join the Data36 Inner Circle (it’s free).
 If you want to learn more about how to become a data scientist, take my 50-minute
video course: How to Become a Data Scientist. (It’s free!)
 Also check out my 6-week online course: The Junior Data Scientist’s First Month video
course.
Cheers,
Tomi Mester

172

Arduinojson 6
75% (4)
Arduinojson 6
327 pages
SAS Programming 2 Data Manipulation Techniques - Quizzes PDF
100% (1)
SAS Programming 2 Data Manipulation Techniques - Quizzes PDF
92 pages
Silk Road Security
No ratings yet
Silk Road Security
30 pages
12 Comp Sci 1 Revision Notes Pythan Advanced Prog
No ratings yet
12 Comp Sci 1 Revision Notes Pythan Advanced Prog
5 pages
Python
100% (3)
Python
111 pages
Python 3.2 Reference Card
100% (12)
Python 3.2 Reference Card
2 pages
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
100% (2)
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
56 pages
CSE Syllabus
No ratings yet
CSE Syllabus
36 pages
Oop in Python Best Resources
No ratings yet
Oop in Python Best Resources
1 page
Python Book
100% (2)
Python Book
394 pages
Python 2 Python 3
100% (1)
Python 2 Python 3
4 pages
Complete Python Bootcamp - Udemy
0% (3)
Complete Python Bootcamp - Udemy
11 pages
Python Bokeh Cheat Sheet
No ratings yet
Python Bokeh Cheat Sheet
1 page
Codeacademy Python
No ratings yet
Codeacademy Python
37 pages
Python Question Solution
No ratings yet
Python Question Solution
11 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
1 page
3 Interesting Python Projects With Code For Beginners!: Data Science Blogathon
No ratings yet
3 Interesting Python Projects With Code For Beginners!: Data Science Blogathon
7 pages
Python Tutorial For Beginners - Learn Python Programming - Intellipaat
No ratings yet
Python Tutorial For Beginners - Learn Python Programming - Intellipaat
29 pages
Jupyter Notebook Help
No ratings yet
Jupyter Notebook Help
18 pages
Coffee Break Python Slicing
No ratings yet
Coffee Break Python Slicing
110 pages
12th Phython Unit-1 PDF
0% (1)
12th Phython Unit-1 PDF
86 pages
Python Demo
No ratings yet
Python Demo
100 pages
Python Keywords, Identifiers and Variables - Fundamentals
No ratings yet
Python Keywords, Identifiers and Variables - Fundamentals
179 pages
Python Programming Examples
No ratings yet
Python Programming Examples
19 pages
Unit1 - Introduction & Syntax of Python Program
No ratings yet
Unit1 - Introduction & Syntax of Python Program
7 pages
Basic Python Programming 1st Edition
No ratings yet
Basic Python Programming 1st Edition
117 pages
Python Numpy Tutorial (CS231n-Stanford)
No ratings yet
Python Numpy Tutorial (CS231n-Stanford)
27 pages
Python Cheatsheets Ds
No ratings yet
Python Cheatsheets Ds
11 pages
Python With Data Science
No ratings yet
Python With Data Science
102 pages
Python 3 Beginner's Reference Cheat Sheet: by Via
100% (1)
Python 3 Beginner's Reference Cheat Sheet: by Via
1 page
Python Cheat Sheetkkk
100% (1)
Python Cheat Sheetkkk
17 pages
Intermediate Python Cheat Sheet
No ratings yet
Intermediate Python Cheat Sheet
3 pages
Lab 2 - Introduction To Python Programming PDF
No ratings yet
Lab 2 - Introduction To Python Programming PDF
21 pages
Python
No ratings yet
Python
9 pages
Coding For Kids in Python Create Your First Game With Python
100% (1)
Coding For Kids in Python Create Your First Game With Python
156 pages
Concurrency in Python Tutorial
100% (2)
Concurrency in Python Tutorial
109 pages
The Python Handbook
No ratings yet
The Python Handbook
115 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
16 pages
Basic Python Programs: o o o o o o o o o
No ratings yet
Basic Python Programs: o o o o o o o o o
4 pages
Python
No ratings yet
Python
54 pages
Lesson 5 Python For Loops While Loops
No ratings yet
Lesson 5 Python For Loops While Loops
7 pages
Python Crash Course Strings, Math
No ratings yet
Python Crash Course Strings, Math
27 pages
Python Project Report
No ratings yet
Python Project Report
39 pages
Learning Python
100% (3)
Learning Python
210 pages
BOOK Python 3 Cheat Sheet
100% (1)
BOOK Python 3 Cheat Sheet
27 pages
Python Types Branching Functions: If Def Return
No ratings yet
Python Types Branching Functions: If Def Return
1 page
Python3 Programming Language: Tahani Almanie
100% (1)
Python3 Programming Language: Tahani Almanie
57 pages
Cbse Python Language Basics
No ratings yet
Cbse Python Language Basics
11 pages
Guide For Python
No ratings yet
Guide For Python
78 pages
R19 Python UNIT - 1 (Ref-2)
No ratings yet
R19 Python UNIT - 1 (Ref-2)
48 pages
Giao Trinh Pythong Eng
No ratings yet
Giao Trinh Pythong Eng
436 pages
Python by Example Book 1 (Fundamentals and Basics)
100% (1)
Python by Example Book 1 (Fundamentals and Basics)
57 pages
Python Numpy Tutorial
No ratings yet
Python Numpy Tutorial
26 pages
Python Workbook v1 0 en
100% (1)
Python Workbook v1 0 en
83 pages
Introduction To The Python Programming Language
No ratings yet
Introduction To The Python Programming Language
41 pages
CSE 4235c Python Assignment 01
No ratings yet
CSE 4235c Python Assignment 01
14 pages
NumPy, SciPy and MatPlotLib
100% (1)
NumPy, SciPy and MatPlotLib
18 pages
An Intro To Python and Algorithms
No ratings yet
An Intro To Python and Algorithms
199 pages
Hands OnPythonTutorial
100% (1)
Hands OnPythonTutorial
186 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
11 pages
Real Python Part 3
100% (1)
Real Python Part 3
404 pages
00 01 Python Course Guide PDF
100% (1)
00 01 Python Course Guide PDF
148 pages
Python For Data Science - The Basics - Data Science Parichay
No ratings yet
Python For Data Science - The Basics - Data Science Parichay
12 pages
Accelerate Your Data Science Journey
No ratings yet
Accelerate Your Data Science Journey
21 pages
Videos To Download
No ratings yet
Videos To Download
23 pages
Vibal Webinars
No ratings yet
Vibal Webinars
18 pages
Learni NG Activit y Sheets: Bartolo, John Eleale Caingat
No ratings yet
Learni NG Activit y Sheets: Bartolo, John Eleale Caingat
1 page
Grad E9 Sectioning October 3
No ratings yet
Grad E9 Sectioning October 3
12 pages
Quarter 1 Recitation and Attire Record
No ratings yet
Quarter 1 Recitation and Attire Record
4 pages
833 2278 1 PB PDF
No ratings yet
833 2278 1 PB PDF
13 pages
eBook Mobile Application Development (2)
No ratings yet
eBook Mobile Application Development (2)
193 pages
Weighbridge TS
No ratings yet
Weighbridge TS
6 pages
Web Designing Tutorial Syllabus
No ratings yet
Web Designing Tutorial Syllabus
6 pages
Quectel SC66&SC600x&SC200R Series Sensor Configuration Guide V1.0
No ratings yet
Quectel SC66&SC600x&SC200R Series Sensor Configuration Guide V1.0
20 pages
ST22
No ratings yet
ST22
53 pages
AJP Microproject Roll No.1536,1535,1545
No ratings yet
AJP Microproject Roll No.1536,1535,1545
13 pages
IChemLabs ChemDoodle 8.x User Guide
No ratings yet
IChemLabs ChemDoodle 8.x User Guide
297 pages
SAP HANA SQL Script Reference en PDF
No ratings yet
SAP HANA SQL Script Reference en PDF
256 pages
Signiwis.docx (1)
No ratings yet
Signiwis.docx (1)
27 pages
Transaction Codes
No ratings yet
Transaction Codes
6 pages
IBM UML 2.0 Advanced Notations
No ratings yet
IBM UML 2.0 Advanced Notations
21 pages
Essbase Course
No ratings yet
Essbase Course
3 pages
Saurabh Kumbhar: Education Profile Summary
No ratings yet
Saurabh Kumbhar: Education Profile Summary
1 page
Repetition Statements (Loops)
No ratings yet
Repetition Statements (Loops)
20 pages
Java University Paper Questions MCA Mumbai University
No ratings yet
Java University Paper Questions MCA Mumbai University
2 pages
SP and MP - TM - Chapter 1
No ratings yet
SP and MP - TM - Chapter 1
24 pages
08 Advanced Techniques
No ratings yet
08 Advanced Techniques
36 pages
CS6202 P&DS
No ratings yet
CS6202 P&DS
7 pages
A Brief History of Software Engineering
No ratings yet
A Brief History of Software Engineering
8 pages
Arka Chakraborty Resume
No ratings yet
Arka Chakraborty Resume
2 pages
1.JAVA Practicals
No ratings yet
1.JAVA Practicals
33 pages
CE 303 Algorithms and Data Structures: Instructor: Saptadi Nugroho, M.SC
No ratings yet
CE 303 Algorithms and Data Structures: Instructor: Saptadi Nugroho, M.SC
34 pages
Swift Is Like Kotlin
No ratings yet
Swift Is Like Kotlin
8 pages
Computer Programming Midterm Reviewer
No ratings yet
Computer Programming Midterm Reviewer
7 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Research Alternatives: Javascript Mini-Projects Language Learning Game
No ratings yet
Research Alternatives: Javascript Mini-Projects Language Learning Game
65 pages