Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

IntroToPythonAndR Chapter

This chapter introduces the programming languages R and Python, focusing on their essential features for data analysis. It covers installation, basic commands, and the use of RStudio as an Integrated Development Environment (IDE) for R, including how to install packages and perform operations with vectors. The chapter also discusses logical operations, functions, and statistical functions available in R for data analysis tasks.

Uploaded by

Cindy Noadje
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

IntroToPythonAndR Chapter

This chapter introduces the programming languages R and Python, focusing on their essential features for data analysis. It covers installation, basic commands, and the use of RStudio as an Integrated Development Environment (IDE) for R, including how to install packages and perform operations with vectors. The chapter also discusses logical operations, functions, and statistical functions available in R for data analysis tasks.

Uploaded by

Cindy Noadje
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344898430

Introduction to Programming R and Python Languages

Chapter · January 2017


DOI: 10.4018/978-1-68318-016-6.ch002

CITATIONS READS

0 1,477

2 authors:

Rui Sarmento Vera Costa

79 PUBLICATIONS 255 CITATIONS


University of Porto
24 PUBLICATIONS 153 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Rui Sarmento on 28 April 2021.

The user has requested enhancement of the downloaded file.


26

Introduction to Programming
R and Python Languages

INTRODUCTION
This chapter introduces the basic concepts of using the languages we propose to approach the data
analysis tasks. Thus, we first introduce some features of R and then we also present some necessary
features of Python. We stress that we do not cover all features of both languages but the essential
characteristics that the reader has to be aware of to progress in furthers stages of this book.

TOOLS
As previously stated, besides focusing on the statistical tasks later in this book, we will provide practice
procedures and examples in both R and Python languages. There are many information sources about
these languages. We will state a brief summary of both languages characteristics. We will start with R
language. If the reader needs information for Python we make it available further in this chapter.

R
R is a powerful programming language, used in statistical tasks, data analysis, numerical analysis and
others. The main characteristics of R are:
• Powerful
• Stable
• Free
• Programmable
• Open Source Software
• Directed to the Visualization of Data

On the downside, R might not be initially suitable for everyone since it needs user inputs on the command
line. We will deal with this in this chapter to make the reader’s life easier.
First, the reader will need to install R for his/her operating system (OS). R is available for Mac, Windows,
and Linux on a website1. Figure 13 shows an overview of the website to download R.

1
https://cran.r-project.org
27

Figure 13 Overview of the website to download R

How to Use
R installation comes with a set of executables, including a GUI (Graphical User Interface) executable (for
example, in windows, it is usually named as RGui.exe). Figure 14 shows this GUI.

Figure 14 Example of an R executable GUI

In Figure 14, the reader has some information available, including the R version previously installed in
the computer. Additionally, some commands are available to explore help regarding R commands. Figure
15 shows the use of the command help.start() and q(). The command help.start() opens a browser
window with a manual browser version of R; this is represented in Figure 16. The q() command
previously stated quits the R application.
28

Figure 15 Example of some R commands

Figure 16 R manual (Internet browser version)

In the manual browser, several links allow the reader to access documentation that can provide further
assistance. The search for an answer in the manuals might be needed in the future adventure with R or any
other language, so the reader should not be afraid to explore this documentation when needed.

A Session with R
With the RGui opened, the reader can try several commands, including expressions. For example, try to
do a simple mathematics operation. Input the following expression in R console and press enter:
3   +  2 ∗ 5  
In Figure 17, the result provided in the R console is presented. It is clearly stated that the result is 13 in
the line following the input command with the expression to solve.
29

Figure 17 Example of mathematics operations with R’s console

Additionally, in Figure 17, we represent how to store the result of an expression in an object. In this
example we stated that 𝑥 equals three squared with the following expression:
𝑥 ← 3^2  
The result, 9, is stored in the 𝑥 object. The reader can name this object anything he/she likes unless named
objects with white spaces. For example 𝑚𝑦_𝑜𝑏𝑗𝑒𝑐𝑡_𝑥 would work nicely but my object 𝑥 would give a
syntax error. The reader should also remember that, when naming his objects, R is case sensitive. Hence,
for example, an object called X is completely distinct to the object x.
Also, in Figure 17, we provide an insight of how to use the stored object x, this time, to obtain the square
root of the value of x with the expression sqrt(x). The reader probably has noticed that, when storing the
result in x, the compiler did not provide a result for the expression in x. This can be done by just inputting
a command with the object name x and pressing enter. Then, the compiler provides the result of the
expression previously stored in x.

Installing RStudio
By now, the reader is probably asking himself/herself if there is a better way to work with R commands.
It is clear that inputting a command one at each time and hitting enter, in the end, is too much of a
workload. Thus, we will suggest the use of an Integrated Development Environment (IDE), to be able to
work efficiently with R. There are several IDEs and GUIs available nowadays, like for example
30

RCommander GUI and the RStudio IDE. We will proceed with the RStudio IDE2. Figure 18 shows an
overview of the site to download RStudio.

Figure 18 Overview of the site to download RStudio IDE

After installation, the reader should execute the RStudio program. Then, immediately notice four
windows on the screen as appears in Figure 19. The upper left window is where the reader inserts the
commands that wish R to run. In Figure 19 we entered the same previous code mentioned before when
writing about R console commands.
Additionally, we can clearly see that in the lower left window, the R console also appears. This will be
where the results of the commands appear. The upper right window shows the environment objects and
again, the only object available at the moment; the x object is presented in this window as well as the
value of the object after running the code we provided. The reader might be asking how to run the code
by now. We have two choices, clicking the upper left window button named Run or the button Source.
Nevertheless, these buttons have different behaviors. With the Run button, the code is executed one line
at each time. Additionally, the parts of the code that were selected with the mouse can also be performed
with the Run button. The reader could try, for example, to select only the line sqrt(x) and click Run. Only
this line would be executed. By clicking Source instead, all the code present in the upper left window will
run at once.
Finally, in the lower right window, several tabs will provide several types of experiments with R. Here, it
is possible to have access to the R manual, and search by keywords through the guides. Additionally, this
is also the window that will present plots or charts.

2
https://www.rstudio.com/home/
31

Figure 19 RStudio screen

Installing Packages
The procedure to install packages is something useful that the reader will be doing throughout this book.
Although R comes with many libraries already from the initial installation, there are many additional
packages developed by the community. For certain tasks, these libraries are needed. Thus, it is required to
install additional packages.
In RStudio, if the reader clicks on the Tools menu, one of the options the reader has is to install packages.
Figure 20 represents these actions.

Figure 20 Tools to install packages in RStudio


32

Then, a small window appears, and the reader should write what package he/she is installing. Figure 21
shows this new window.

Figure 21 Install packages window (RStudio)

Please notice that as the reader writes the packages names, RStudio will suggest several packages and the
reader should select the ones he/she needs. We present an example where we installed the package
“StatRank”. On the console, in the lower left window, what appears now is a description of the status of
the package installation:

> install.packages("StatRank")

Installing package into ‘C:/Users/Rui Sarmento/Documents/R/win-library/3.0’


(as ‘lib’ is unspecified)

There is a binary version available (and will be


installed) but the source version is later:
binary source
StatRank 0.0.4 0.0.6
also installing the dependencies ‘evd’, ‘truncdist’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.0/evd_2.3-0.zip'


Content type 'application/zip' length 1176785 bytes (1.1 Mb)
opened URL
downloaded 1.1 Mb

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.0/truncdist_1.0-1.zip'


Content type 'application/zip' length 26454 bytes (25 Kb)
opened URL
downloaded 25 Kb

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.0/StatRank_0.0.4.zip'


Content type 'application/zip' length 147850 bytes (144 Kb)
opened URL
downloaded 144 Kb

package ‘evd’ successfully unpacked and MD5 sums checked


package ‘truncdist’ successfully unpacked and MD5 sums checked
package ‘StatRank’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in


C:\Users\Rui Sarmento\AppData\Local\Temp\RtmpgBJthk\downloaded_packages
33

As we have selected the option to install dependencies (recommended), RStudio has proceeded with the
download of the needed files from the Internet and installed all required packages including packages
dependencies, which are other packages themselves.

Vectors
Vectors are a typical structure in programming. The reader can store several values of the same type in a
vector. Imagine a train composition with several coaches. Each wagon would be a position of the vector
(the train), and each coach has a stored value. For example, we represent a vector of integer values from 1
to 10 this way:

> vector <- 1:10


> vector
[1] 1 2 3 4 5 6 7 8 9 10

If the reader needs to do an operation with the vector, R applies this operation to all positions in the
vector. Imagine we wanted to add 2 to all the elements in the vector, and then we would simply do:
> vector + 2
[1] 3 4 5 6 7 8 9 10 11 12

The reader can also apply operations to vectors. For example, he/she can add another vector to the
previous one. Imagine we wanted to add the following vectors:
> vector2 <- 3:12
> vector2
[1] 3 4 5 6 7 8 9 10 11 12

> vector + vector2


[1] 4 6 8 10 12 14 16 18 20 22

The reader must keep in mind that the vectors should have the same length. Otherwise, the compiler
produces a warning and sums the vector, but it recycles the first vector to do the addition of vectors. We
will return to this later, and we will explain better what happens with vectors of different lengths.

Type
The type of values that a vector can store is variable. The most used types are:
• Character
• Logical
• Numeric
• Complex

With the function mode(), we can check what is the type of the vector:
> mode(vector)
[1] "numeric"

As we expected, our vector object is a numeric vector with integers from 1 to 10 stored in it. An example
of other types of vectors could be:
> char.vector <- c("String1","String2","String3")
> mode(char.vector)
[1] "character"

In the previous example, we used another function to create a vector, the c() function. This function
allows us to create any vector, for instance, if we wish to create a numeric vector we could do:
34

> num.vector <- c(12.5,5.64,7.84)


> mode(num.vector)
[1] "numeric"

Length
Sometimes it is convenient to know the extension of the vectors. This can be achieved with the function
length(). Some examples using this function are:
> length(vector)
[1] 10

> length(char.vector)
[1] 3

> length(num.vector)
[1] 3

Indexing
We can access the elements of a vector by using indexes. For example, to access the first element of the
previously stated vector (char.vector) we would write the following command:
> char.vector[1]
[1] "String1"

If we would like to check a sequence of vector positions, for example 1 through 2 we can do it several
ways like:
> char.vector[1:2]
[1] "String1" "String2"

> char.vector[c(1,2)]
[1] "String1" "String2"

To check, the first position of the vector char.vector and then the third, we would write it like this:
> char.vector[c(1,3)]
[1] "String1" "String3"

Vector Names
We can also name the vectors elements or positions. For example, with our vector num.vector with length
3, we could issue the following command:

> names(num.vector) <- c("Math Grade", "French Grade", "German Grade")


> num.vector
Math Grade French Grade German Grade
12.50 5.64 7.84

With the previous example, it is clear now that we transformed vector num.vector with additional
information about each stored element. Also, following the previous indexation procedures, we can also
retrieve a vector’s information through its element’s name(s):

> num.vector["Math Grade"]


Math Grade
12.5

> num.vector[c("Math Grade","German Grade")]


Math Grade German Grade
12.50 7.84
35

Logical Operations with Vectors


R allows fascinating logical operations with vectors. As an example, if we need to know the positions of
the vectors with grades above ten we could do:

> num.vector[num.vector > 10]


Math Grade
12.5

We can also do logical operations with intervals. For example, if we need to know the grades in our
numeric vector that are above six AND below ten we would do:

> num.vector[num.vector < 10 & num.vector > 6]


German Grade
7.84

Other operators like logical OR are also possible. Imagine we needed to know the grades above 10 OR
below six we would do:

> num.vector[num.vector > 10 | num.vector < 6]


Math Grade French Grade
12.50 5.64

Functions
If the reader has been following our first examples, it is expected that he/she has already used some
functions. Remember sqrt(), length(), or even mode()? Those are functions.
Functions are useful because they avoid the programmer to re-write all the code inside a function every
time he/she wants to use it again. The great thing about new libraries or packages is that it comes
generically with a set of functions that provide pre-determined operations. In other words, functions have
inputs, with those inputs some internal procedures take place to give an output the user desires. Have a
look at the following example of a function R code:

add <- function(x,y){


x+y
}

This is the declaration of a function in R. The name of the function we programmed is add(). If we look
closely, we can see that this function has two possible inputs, x and y, and the internal instruction is to add
these two inputs. An example of use of this function would be:

> add(x=2,y=2)
[1] 4

Evidently, in this example, we wish to add two plus two, which are respectively the inputs x and y of the
function. The result we obtain in this case is correct and equal to 4.

Statistical Functions
R has a variety of included functions that we might use in our statistical tasks. Some of them are:
• max
• min
• mean
• sd
• summary
• and many others
36

Examples of those functions with our numeric vector num.vector would provide the following results:
> max(num.vector)
[1] 12.5

> min(num.vector)
[1] 5.64

> mean(num.vector)
[1] 8.66

> sd(num.vector)
[1] 3.502742

> summary(num.vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.64 6.74 7.84 8.66 10.17 12.50

Some of these functions have names that are self-explanatory of what they do. Some others like sd
(standard deviation) and summary will have a better explanation given further in this book (in the
descriptive chapter).

Another useful function we will use later in this book is the table() function. Suppose we have the
information about the grades of students in several Ph.D. courses. We have the following vectors:

> students <- c("John","Mike","Vera","Sophie","Anna","Vera","Vera","Mike","Anna")

> courses <- c("Math","Math","Math","Research","Research 2","Research","Research


2","Computation","Computation")

> grades <- c(13,13,14,16,16,13,17,10,14)

If we want to count the number of grades we have for each student we can do:

> table (students)


students
Anna John Mike Sophie Vera
2 1 2 1 3

Additionally, we can cross two vectors by creating a contingency table. For this we can do:

> table (students, courses)


courses
students Computation Math Research Research 2
Anna 1 0 0 1
John 0 1 0 0
Mike 1 1 0 0
Sophie 0 0 1 0
Vera 0 1 1 1

The results show the courses each student is taking and also how many students we have for each course
in this available data.

Factors
When we have character vectors, i.e. a categorical vector and a large amount of data, it is positive to store
it in a compressed fashion. For example, with the vector courses, which is a character vector we can
transform it to factors by using the following command:

> courses.factors <- factor(courses)


37

This command results in the following:


> courses.factors
[1] Math Math Math Research Research 2
[6] Research Research 2 Computation Computation
Levels: Computation Math Research Research 2

The previous command also outputs the levels of the factor transformation. These levels are the unique
values of the transformed variable.
The following function is used to check the levels of the compression of a character vector:
> levels(courses.factors)
[1] "Computation" "Math" "Research" "Research 2"

Data frames
Another interesting data structure available in R is the dataframe. Data frames can be viewed as tables
that can contain vectors of different types. For example, if we wish to transform our previous vectors
students, courses, and grades to a data frame we would use the function data.frame() like this:

> my.dataframe <- data.frame(student=students, course=courses, grade=grades)


> my.dataframe
student course grade
1 John Math 13
2 Mike Math 13
3 Vera Math 14
4 Sophie Research 16
5 Anna Research 2 16
6 Vera Research 13
7 Vera Research 2 17
8 Mike Computation 10
9 Anna Computation 14

With the data.frame() function we can, therefore, create a dataframe with the names of each column and
the respective values which in this case were our previously created vectors.

How to Edit
There is another way to create or edit a dataframe. By using the function edit() we can write something
like the following:
edit(my.dataframe)

After inputting the previous command, a window opens in the RStudio IDE. The new window allows
editing the content of the dataframe. The reader can also start a new dataframe like this. If we wanted an
empty dataframe we could do:

> my.empty.dataframe <- data.frame()


> edit(my.empty.dataframe)

A new window would appear, this time, different from Figure 22. In this new window, an empty table
with no values or named variables would be available for us to write values in the cells of the table. As we
write the name of the variables, RStudio asks what is the type of the variable we wish to input. The
options to choose are numeric or character. When we finish introducing character variables, the compiler
transforms them to factors.
38

Figure 22 Example of an RStudio’s data frame edit window

Indexing
There are several possible ways of reaching a value inside a dataframe structure. As an example, imagine
we wanted to list all students in the dataframe. We could do it by writing down one of the following
commands:

> my.dataframe$student
[1] John Mike Vera Sophie Anna Vera Vera Mike Anna
Levels: Anna John Mike Sophie Vera

> my.dataframe[,1]
[1] John Mike Vera Sophie Anna Vera Vera Mike Anna
Levels: Anna John Mike Sophie Vera

In the first example, as we know the column name, we used the name of our dataframe, the symbol $ and
the name of the column to check the entire column. If we did not know the name of the column, we could
write down the second command, which is the basis of the indexing of dataframes. What happens inside
the brackets is that the first element before the comma indicates the selected rows of the dataframe. As we
can see, this is empty which means we are selecting every row. After the comma, the value 1 indicates we
wish to output the column with index 1. Please verify this explanation in Figure 23.
39

> my.dataframe[ ,1]

Selected  Row(s) Selected  Column(s)


Figure 23 Schema of data frames indexing example

Indexing can become even more powerful in R. As the reader might already realize, we are retrieving
vectors with our last commands. If we wish to know some particular index of these vectors, we can use
another index inside brackets like this:
> my.dataframe$student[1]
[1] John
Levels: Anna John Mike Sophie Vera

> my.dataframe[,1][1]
[1] John
Levels: Anna John Mike Sophie Vera

The previous commands will give us the first element of the obtained vectors.

Filters
Like we did with vectors, we can use R’s powerful filtering features to extract the results we need from
our dataframe. Please mind the following examples:

- Are there grades superior to 14?


> my.dataframe$grade > 14
[1] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE

- Whose students have grades superior to 14?


> my.dataframe$student[my.dataframe$grade > 14]
[1] Sophie Anna Vera
Levels: Anna John Mike Sophie Vera

The first command outputs either TRUE or FALSE regarding our question if there are grades superior to
14. The second command gives us the students that had these grades, superior to 14 as we wished to
know.

Nonetheless, using appropriate commands can also use indexing and filtering to edit a data frame. As an
example, imagine we want to change Vera’s Math grade from 14 to 16. The following commands would
be appropriate:

> my.dataframe
40

student course grade


1 John Math 13
2 Mike Math 13
3 Vera Math 14
4 Sophie Research 16
5 Anna Research 2 16
6 Vera Research 13
7 Vera Research 2 17
8 Mike Computation 10
9 Anna Computation 14

> my.dataframe[3,3] <- 16


> my.dataframe
student course grade
1 John Math 13
2 Mike Math 13
3 Vera Math 16
4 Sophie Research 16
5 Anna Research 2 16
6 Vera Research 13
7 Vera Research 2 17
8 Mike Computation 10
9 Anna Computation 14

Or we could use the following command if we do not know columns or row indexes:

> my.dataframe$grade[my.dataframe$student=="Vera" & my.dataframe$course == "Math"] <-


16

If we feel a little bit lazy to write down these commands, please remember the edit() function we talked
about before.

Useful Functions
There are some interesting functions we can use with our dataframes. Please mind the following list:
• nrow – Gives the dataframe’s number of rows
• ncol – Gives the dataframe’s number of columns
• colnames – Gives the dataframe’s column names
• rownames – Gives the dataframe’s row names
• mode – Gives the dataframe’s data type
• class– Generic function that can be used to check if it is a dataframe we are working with
• summary

Examples of outputs with the previous functions are:

> nrow(my.dataframe)
[1] 9

> ncol(my.dataframe)
[1] 3

> colnames(my.dataframe)
[1] "student" "course" "grade"

> rownames(my.dataframe)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"

> mode(my.dataframe)
[1] "list"

> class(my.dataframe)
[1] "data.frame"

> summary(my.dataframe)
41

student course grade


Anna :2 Computation:2 Min. :10.00
John :1 Math :3 1st Qu.:13.00
Mike :2 Research :2 Median :14.00
Sophie:1 Research 2 :2 Mean :14.22
Vera :3 3rd Qu.:16.00
Max. :17.00

Matrices
Matrices are different from dataframes in R. They can only store elements of the same type, usually
numeric. They are useful to store two-dimensional data, and they can be seen as vectors of two
dimensions. The function matrix() is appropriated to create a matrix. We use the following code to do
this:

> my.matrix <- matrix(c(12,13,14,10,12,15,16,12), nrow=2, ncol=4)


> my.matrix
[,1] [,2] [,3] [,4]
[1,] 12 14 12 16
[2,] 13 10 15 12

The first input is the values we wish the matrix to have, the second input is the number of rows the matrix
will have and the third input is the number of columns.

Nevertheless, there is an easier way to input a matrix data. For example, by using the function
data.entry().
With the following commands the reader will understand it better:
> my.matrix <- matrix(,2,4)
> data.entry(my.matrix)

With these commands, a new window opens. Within these window’s cells, we can input the values for
our 2x4 matrix. Figure 24 shows this window.
42

Figure 24 Example of an RStudio’s matrix edit window

Matrix Indexing
The indexes of a matrix are identical to the data frames or vectors. They are two-dimensional. For
example, keep in mind the following examples:

> my.matrix[1,]
[1] 12 14 12 16

> my.matrix[1,4]
[1] 16

> my.matrix[,4]
[1] 16 12

The first example would give the first row of the matrix. The second example gives the value of the first
row and fourth column.

Row and Columns Names


Similar to data frames we can name columns and rows with the functions rownames() and colnames().
Please check the following examples:

> rownames(my.matrix) <- c("Vera","Mike")


> colnames(my.matrix) <- c("W1","W2","W3","W4")
> my.matrix
43

W1 W2 W3 W4
Vera 12 14 12 16
Mike 13 10 15 12

Then, we can use the names we chose to retrieve values in the matrix. For example:

- What was Vera’s grade in work 4?


> my.matrix["Vera","W4"]
[1] 16

Importing and Exporting Data with R


There are several possible ways to import data with R. We will explain one of these ways, the reading of
CSV (comma separated values) files but others are also possible, like reading data from a database or an
Internet URL. Later in this chapter, we will also talk how to export data to Excel.

Read CSV files


We can read the data from a CSV file by using the function read.csv(). However, before opening a file
with this function, we should set the working directory of R. For this, in RStudio we should look for the
Session menu. Then, the following Figure 25, clarifies where the reader should click.

Figure 25 Setting of the working directory of R (RStudio)

After clicking Choose Directory, the user can select the directory where the CSV file is. For example, for
the test.csv file with the following content:

student,course,grade
John,Math,13
Mike,Math,13
Vera,Math,14
Sophie,Research,16
Anna,Research 2,16
Vera,Research,13
Vera,Research 2,17
44

Mike,Computation,10
Anna,Computation,14

With the following command we would read the csv file (test.csv) to a data frame named csv.file:

> csv.file <- read.csv("test.csv")


> csv.file
student course grade
1 John Math 13
2 Mike Math 13
3 Vera Math 14
4 Sophie Research 16
5 Anna Research 2 16
6 Vera Research 13
7 Vera Research 2 17
8 Mike Computation 10
9 Anna Computation 14

Export to Excel
First, install the xlsx package. With this package, the reader can write to Excel files. Check the following
Figure 26:

Figure 26 Installing xlsx package in RStudio

The reader just has to load the package first, after he/she has installed it. For loading the package, this
procedure can be done with the function library(). The following code write the data frame to an Excel
file named my_excel_file.xlsx:

> library(xlsx) #load the package


> write.xlsx(x = my.dataframe, file = "my_excel_file.xlsx",sheetName = "Sheet 1",
row.names = FALSE)

With the function write.xlsx() a new xlsx file will appear in the reader’s working directory. This file now
contains our familiar student’s grades data. Please check the file by opening it with Excel; the result is in
Figure 27.
45

Figure 27 Example of an xlsx file opened in Excel

The reader might have noticed we used a new function, the library() function we have never used before.
This function has one input, the name of the package we wish to load before using its available functions.
The function we used from this package was the write.xlsx() function.

PYTHON
Python is a programming language, used in any application the reader might want. The key features of
Python are very similar to R:
• Powerful
• Stable
• Free
• Programmable
• Open Source Software

On the contrary side, that might not be initially suitable for everyone considering the tasks of data
analysis is that it needs the user to select specific packages carefully. The reader will have to choose those
that are appropriated to his/her intents. We will deal with this in this chapter to make the reader’s life
easier.
There are several Python distributions nowadays. Distributions are available depending on the area a
language is used, and typically includes the libraries that are needed for certain tasks.
First, the reader will need to install Anacondas Python’s distribution for his/her operating system (OS).
Anaconda is available for Mac, Windows, and Linux.

Installing Anaconda
46

Anaconda is a set of libraries that are unique to the Data Analysis, Statistics and Machine Learning areas,
among others. It has several libraries we will need further in this book. The reader should follow
installation procedures for installing Anaconda on the website 3 presented in Figure 28.

Figure 28 Overview of the site to download Anaconda Python’s distribution

Python’s Spyder IDE


Following Anaconda’s installation, the reader should look for the Spyder IDE, which comes with the
Anaconda package. This IDE provides efficient ways of working with Python and will be of great help in
the tasks we have ahead in this book. Thus, we will avoid using Python’s GUI and input a command at
each time has we had initially to do with R GUI’s and its console. The reader will immediately notice
three windows on the screen as appears in Figure 29. The left window is where the commands should be
written. In Figure 29 we inserted a similar code mentioned before when writing about R console
commands.
Additionally, the reader can clearly see that, in the lower right window, the console also appears. This
will be where the results of the commands appear. On the upper right window are the environment
objects. The reader might be asking how to run the code by now. There are several choices; we can check
those options on the Run menu (see Figure 30). Thus, these options have different behaviors. The reader
has the possibility to execute one line at each time or the selected parts of the code that have selected with
the mouse. We also have the option to run all the code at once, among other options.

3
https://www.continuum.io/downloads
47

Figure 29 Spyder screen

Figure 30 Options to run the code menu

Finally, in the upper right window, several tabs will provide several types of experiments with Python.
Here, the reader will have access to the Python’s manual, and inclusively can search by keywords through
the guides. This is an interesting feature that allows the programmer to know more about modules’
functions.

Importing Packages
48

If the reader has read the R part of this chapter he/she might have noticed that we used the library()
function to load the packages. Python is similar we have to use the import keyword to load some libraries
and therefore, all its available functions to use after that. For example, the reader might want to inspect
the following example:

import math as math


x=math.pow(3,2)

In this example, we used the keyword to import the math module/library. We also used the keyword as to
name the module to a name of our choice. Thus, after this when the reader wants to call any function
he/she would do it like the previous example. This time, we used the pow() function from math module.

Save a Variable
The reader might already acknowledge that we use the “equal” symbol to assign a value or expression to a
variable. In the previous example, we assigned the expression math.pow(3,2) to the variable x.

Use a Variable in instructions


Please check the following command:
#call to sqrt function
math.sqrt(x)

The reader might find this very similar to R language. To calculate the square root of x, we used the x
variable previously set, with the previous code.

List Variables in Session


In Spyder IDE, the upper right window lists all the variables in the current session. Please check Figure
30 and remind the variable x is the listed variable after we have run the previous commands in this
chapter.

Delete Variables
By using the powerful features of Spyder IDE, the reader can delete any variables stored in memory.
Please mind Figure 31. By right clicking in the variable presented in the variable explorer, a variety of
options appear. Thus, among others, the reader can select to remove the variable from memory.
49

Figure 31 Delete variables in memory (Spyder)

Arrays
The module array defines an object type, which can compactly represent an array of basic values:
characters, integers, floating point numbers. Arrays are sequence types and behave very much like lists,
except that the type of objects stored in them is constrained.
To declare an array in Python we can use the following code:

import array as array


my_array = array.array('i',(range(1,11)))

This will produce the following output:


...: my_array
Out[10]: array('i', [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

If the reader needs to do an operation with the array, Python applies a defined operation to all positions in
the array. Imagine we wanted to add 2 to all elements in the array. Then, we would do:

[x+2 for x in my_array]

And the result of the operation input would be:


[x+2 for x in my_array]
Out[11]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Additionally, if we wish to add two vectors we can do it in a variety of ways. Our favorite is to use the
numpy module and its function add() like this:

import numpy as np
new_array = np.add(my_array, my_array2)
new_array
50

The reader might have noticed that we could apply the same function to add 2 to the array as we
previously stated. With this new function we would do:

import numpy as np
new_array2 = np.add(my_array, 2)
new_array2

The result of the operation would be, as expected, similar to the previous operation with my_array.
Out[18]: array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=int32)

Type
The type is specified at object creation time by using a type code input in the array function, which is a
single character. There are several type possible codes. The reader should check the array module
manuals for further information, as there are many possible inputs in this parameter.

Length
Sometimes it is convenient to know the extension of the arrays. This can be achieved with the function
len(). Some examples using this function are:

len(my_array)
Out[18]: 10

Indexes
Indexation of arrays in Python is much similar with vector indexation in R. Nonetheless, keep in mind
that indexes with R start in 1. In Python, the indexes of structures start with 0. Therefore, to extract the
fourth position of our previously created array we would do:

my_array[3]
Out[19]: 4

Remember, due to the differences with indexation, if we were to retrieve the first position of our Python
array, we would do:
my_array[0]
Out[20]: 1

Please mind the following instructions for setting an array of strings:

char_array = ['String1','String2','String3']
char_array
char_array[0:2]

We wish to output the first two elements in the array with the last command. The output would be:
char_array[0:2]
Out[35]: ['String1', 'String2']

Please keep in mind that Python has different indexation than R. The reader might have noticed that, with
the previous command, we are selecting and expecting position 0 and 1 of the array. Nonetheless we
declared char_array[0:2], i.e., from position 0 to position 2, excluding this last position.
If we needed to know the array value in the first position and the third we would do the following
command:

char_array[0::2]
51

The output of this command would be:


char_array[0::2]
Out[38]: ['String1', 'String3']

Functions
The great thing about new libraries or packages is that it comes generically with a set of functions that
provide pre-determined operations. In simple words, functions have inputs, and with those inputs, some
internal procedures take place to give an output the user desires. Have a look at the following definition of
a function Python pseudo-code:

def functionname( parameters ):


#intructions inside the function
return [expression]

An example of a function declaration would be:

def add(x,y):
return x+y

This is the declaration in Python of a function. Please keep in mind that Python requires respect of
indentation of the code. Mind the indentation of the function declaration after the “:” signal. The name of
the function we programmed is add(). If we look closely, we can see that this function has two possible
inputs, x and y, and the private instruction is to add these two inputs. An example of the use of this
function would be:
add(x=2,y=2)
Out[17]: 4

Evidently, in this example, we wish to add two plus two which are respectively the inputs x and y of the
function. The result we obtain in this example is equal to 4.

Useful Functions
There are several functions we will use throughout this book that is related to data analysis and statistics.
The difference to R is that the majority of those functions come included in packages directed to data and
numeric analysis, statistics and others. We will explain more of those functions throughout this book and
its data analysis tasks.

Dataframes

How to Create
Imagine we had the following vectors of students, courses and grades already created with Python like the
following:

students = ["John","Mike","Vera","Sophie","Anna","Vera","Vera","Mike","Anna"]
courses = ["Math","Math","Math","Research","Research 2","Research","Research
2","Computation","Computation"]
grades = [13,13,14,16,16,13,17,10,14]

We wish to create a data frame with these values. Therefore, we write the following commands:

Import pandas as pd
my_grades_dataframe =
pd.concat([pd.DataFrame(students,columns=['student']),pd.DataFrame(courses,columns=['c
ourse']),pd.DataFrame(grades,columns=['grade'])], axis=1)
52

The previous command just concatenates all the arrays previously stated and after transforming each of
the arrays into a data frame, by using the functions available in pandas Python’s module.

How to Edit
By using Spyder’s powerful IDE features, the reader can easily edit a data frame after creation. By
selecting the variable explorer in the upper right window, we can right-click on the data frame we wish to
edit like the following Figure 32:

Figure 32 Editing a data frame in Spyder

After clicking edit the window of Figure 33 appears:


53

Figure 33 Edit data frame window (Spyder)

As the reader might expect, this window is very appropriate to do an edition of data frames. By selecting
a cell in the table, the reader can change the values and hit the OK button. The data frame will be stored in
its new version and accordingly to the reader’s changes operated in the variable.

Indexing
There are several possible ways of reaching a value inside a data frame structure. As an example, imagine
we wanted to list all students in the data frame. We could do it by writing down one of the following
commands:
my_grades_dataframe['student']
Out[66]:
0 John
1 Mike
2 Vera
3 Sophie
4 Anna
5 Vera
6 Vera
7 Mike
8 Anna
Name: student, dtype: object
54

my_grades_dataframe[[0]]
Out[68]:
student
0 John
1 Mike
2 Vera
3 Sophie
4 Anna
5 Vera
6 Vera
7 Mike
8 Anna

In the first example, as we know the column name, we used the name of our data frame. The name of the
column to check the entire column was introduced inside brackets. If we did not know the name of the
column, we could write down the second command, which is the basis of the indexing of data frames
columns.

If we wish to know the cell value of a particular cell in the data frame, we will have to use the function
ix(). For example, to understand the dataframe’s value in the third row and column we would write this
command:

my_grades_dataframe.ix[2,2]

The output would be Vera’s Math grade, which is 14:


my_grades_dataframe.ix[2,2]
Out[69]: 14

Filters
Python’s Pandas module has powerful filtering features to extract the results we need from our data
frame. Please mind the following examples:

- Whose students have grades superior to 14 and to what courses?

#select the grades > 14


my_grades_dataframe[my_grades_dataframe['grade']>14]
Out[71]:
student course grade
3 Sophie Research 16
4 Anna Research 2 16
6 Vera Research 2 17

Nevertheless, using appropriate commands, the reader can also use indexing and filtering to edit a data
frame. As an example, imagine we wish to change Vera’s Math grade from 14 to 16. The following
commands would be appropriate:

my_grades_dataframe.ix[2,2] = 16
my_grades_dataframe
Out[72]:
student course grade
0 John Math 13
1 Mike Math 13
2 Vera Math 16
3 Sophie Research 16
4 Anna Research 2 16
5 Vera Research 13
6 Vera Research 2 17
7 Mike Computation 10
8 Anna Computation 14
55

If we feel a little bit lazy to write down these commands, please remember that the reader can edit the
data frame with the Spyder’s editing feature we talked about before.

Useful Functions
There are some useful functions regarding data frames with Python. For example, the info() function
retrieves, among other information, the number of rows, columns and the memory usage of the data
structure:

my_grades_dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
student 9 non-null object
course 9 non-null object
grade 9 non-null int64
dtypes: int64(1), object(2)
memory usage: 296.0+ bytes

Pandas DataFrame’s also have a describe method, which is ideal for seeing basic statistics about the
dataset's numeric columns. For example, with the following code:

my_grades_dataframe.describe()
Out[76]:
grade
count 9.000000
mean 14.222222
std 2.223611
min 10.000000
25% 13.000000
50% 14.000000
75% 16.000000
max 17.000000

Matrices
Matrices with Python are also possible. The reader should use numpy package to be able to create a
matrix with a simple procedure. Please mind the following example:

my_matrix = np.matrix('0 0 0 0; 0 0 0 0')


my_matrix
Out[78]:
matrix([[0, 0, 0, 0],
[0, 0, 0, 0]])

This time, we created a 2x4 matrix of zeros.

Insert Data in a Matrix


Using Spyder’s editing options as previously stated we could change the values in the matrix. Please
check the Figure 34:
56

Figure 34 Editing matrices in Spyder

Imagine we wish to change the value of the second line and fourth column; then we would have the
following Figure 35:
57

Figure 35 Matrix edition window (Spyder)

We have changed the value to 5 as the previous figure represents. Nonetheless, we could also do it by
writing the following command:
my_matrix[1,3] = 5

Matrices Indexes
In the previous example, we used indexes to change the value of the matrix cells. The indexes of a matrix
are identical to the dataframes, and they start at 0. They are two-dimensional. For example, keep in mind
the following examples:

my_matrix[0,]
Out[83]: matrix([[0, 0, 0, 0]])

my_matrix[0,3]
Out[84]: 0

my_matrix[:,3]
Out[94]:
matrix([[0],
[5]])
58

The first example would give the first row of the matrix. The second example gives the value of the first
row and fourth column. The third example will give the reader all the values of the fourth column.

Importing and Exporting Data with Python

Read CSV files


Reading data from CSV files is also a great feature of Python. We can obtain a data frame from a CSV.
For example, for the test.csv file consider the following content:

student,course,grade
John,Math,13
Mike,Math,13
Vera,Math,14
Sophie,Research,16
Anna,Research 2,16
Vera,Research,13
Vera,Research 2,17
Mike,Computation,10
Anna,Computation,14

First, before reading the previous data from a file, it is necessary to change the working directory to the
directory where our test.csv file is. To do this, please check Figure 36. We can browse a working
directory in the folder icon in the upper right corner of the Spyder IDE.

Figure 36 Changing the working directory (Spyder)

Then, with the following code, it is possible to import the data to the data frame:
import pandas as pd
my_dataframe = pd.read_csv('test.csv')
my_dataframe
59

Out[26]:
student course grade
0 John Math 13
1 Mike Math 13
2 Vera Math 14
3 Sophie Research 16
4 Anna Research 2 16
5 Vera Research 13
6 Vera Research 2 17
7 Mike Computation 10
8 Anna Computation 14

Export to Excel
The Python’s package named pandas has a great function for this task. The function to_excel provides a
way to store data frames in Excel files. The following command:

import pandas as pd
my_dataframe.to_excel('my_excel_file_python.xlsx', sheet_name='Sheet1')

will provide an excel file named 'my_excel_file_python.xlsx'. The result is represented in Figure
37:

Figure 37 Excel file output (Python)

Connecting to other languages


60

The Python’s versatility as a generic language allows the use of other languages within its programming
instructions. One of these possible languages is R. Further in this book we will use this Python’s feature
to execute and exemplify some statistical tasks. The rpy2 module delivers just what is expected from a
connection with another language, specifically R language.

To proceed with the installation of this package, some installation stages are necessary, and the reader
should also install R on his/her computer. Then, the reader should download the package for his/her OS.
With windows, the packages are available on a website 4. The selected .whl (rpy2-2.8.1-cp35-cp35m-
win_amd64.whl) file was appropriated for the installed Python version and 64bit Windows version.

Then, within the Anaconda’s console the following command was inputted:

pip install rpy2-2.8.1-cp35-cp35m-win_amd64.whl

Figure 38 illustrates the input of the previous command and the successful installation of the package
rpy2 in its version 2.8.1.

Figure 38 Installing rpy2 module package in Python

Following the installation procedure, the usual importation of the new module is now possible. For
example, to call the new module in a piece of code, the programmer would write:
import rpy2 as rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as ro

CONCLUSIONS

This chapter presents a reader’s introduction and contextualization of this book programming tasks. With
this chapter, we attempt to introduce the reader to simple programming tasks and his/her comprehension
of the use of features that will be applied elsewhere in this book. Although the book is organized with a
crescent complexity of materials, the reader will encounter an imminently practical book with examples
throughout.
Additionally, in this chapter, we provided a brief summary of the syntax of the languages we are focusing.
We introduced the reader to their consoles, GUIs and IDEs, either for R or Python. We stress that, in this
chapter, we approached just a little bit of the existing material regarding both programming languages.
Nonetheless, we believe that it is possible for the reader to gather information from other sources and we
tried to state them also in this chapter. Consultation of manuals and another information as we go is a
required and needed procedure when learning programming languages.
We will further explore both languages, and the reader will gain a broader look into programming and
statistics at the end of the following chapters.

4
http://www.lfd.uci.edu/~gohlke/pythonlibs/#rpy2
View publication stats

61

The key concepts presented in this chapter include the programming of:
• Vectors
• Dataframes
• Matrices
• Functions
, and the installation of:
• R
• RStudio
• Anaconda Python’s Distribution

Additionally, the reader learned basic operation concepts with both languages IDE’s, RStudio for R and
Spyder for Python.

You might also like