Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PW1 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Analyzing large data sets with the R package

Objective: to acquire skills in statistical data processing and graphics using the R
programming language.
Objectives: to get acquainted with the interface and functionality of R, to learn how to
process, analyze, and visualize various types of data using the R package.

Methodological recommendations

1. Familiarize yourself with the R program and RStudio.


R is a programming language for statistical processing, data analysis, and graphics, as well as a free
and open-source computing environment as part of the GNU project. R supports a wide range of statistical
and numerical methods and powerful additional functional and analytical capabilities (built-in package
library). Packages are libraries for specific functions or special applications. R comes with a basic set
of packages. As of 2019, there are more than 12,000 packages available.
R is widely used in the social sciences, statistics, economics, insurance, sociology, finance,
physics, etc.
R is available for all operating systems, including Linux, Mac OS, and Windows.
R is a matrix, object-oriented programming language. This means that, in theory, anything can
be saved as an R object. Each object has its own class that describes what this object contains and what
each function can do with this data. For example, plot(x) produces one result if x is a regression and
another if it is a vector.
The R package can be downloaded and installed absolutely free of charge. To do this, visit
the CRAN website (https://cran.r-project.org) and download the installation package. After running it,
you need to select the appropriate installation parameters (language, package components, and other
settings). After the program starts, a dialog box opens - the console window, where all commands and
the results of their execution are displayed (Fig. 165).
Figure 165. The dialog box (console) for starting the R package

To make it easier for the user to work with the R package, an interface program called RStudio
was created. Note that working with R and RStudio is almost identical. For example, RStudio provides
many convenient tools, and it has a pleasant and more understandable interface. To download it, you
must first install the R package and then download the corresponding file:
https://www.rstudio.com/products /rstudio/download
/#download.
After launching the program, a window consisting of 4 blocks will open (Fig. 166).
The RStudio environment consists of windows:
1) script (for writing code and preparing it for launch);
2) console (displays all commands and the results of their execution);
3) Workspace (Environment) - to display all objects and their descriptions. The History tab
shows the history of commands entered by the user;
4) Files (displays the working directory - folder), Plots (for visualizing all charts),
Packages (shows all installed library packages), Help (for reference information).
Figure 166. The RStudio startup dialog box

For example, using the R package as a regular calculator, we can write certain mathematical
actions in the script window and press the Run button or use the Cntr + Enter button combination
(Figure 167).

Figure 167. Performing simple calculations in RStudio

After executing the code, the calculation result appeared in the console window, and the
History tab simultaneously displayed all the commands that the user entered.
All functions are entered accordingly.
If you don't know what kind of function you are working with, you should call up the help.
To do this, in the script window, enter a question mark before the function name and click the Run
button. All the necessary reference information will be displayed in the Help window (tab) (Figure
168).

Figure 168. Launching help in RStudio

The user can create variables. To do this, enter the variable designation in the console
window, then the assignment sign (<-) and the value to which it corresponds. For example, y = 100
(Fig. 169).

Figure 169. Creating a variable in RStudio


The Environment window displays the result of creating the variable. You can also perform
calculations (y*2) and create expressions (x = y^2) (Figure 170).

Figure 170. Simple calculations in RStudio

In R, any command is a function that can be passed as an argument.


Functions can be easily combined.
The assignment symbol is "<-". You can also use the wildcard "=". That is, the following
two expressions are equivalent:
> а <- 2;
> а = 2.
Arguments are given in parentheses.
Usually, it is better to use quotation marks for names, but it is not always
necessary.
"#" is used for comments.
Commands are separated by a semicolon ";" or a carriage return character. If you want to place
more than one expression on one line, you must use the ";" separator.
R is case sensitive: "a" and "A" are two different objects, so all functions and arguments
must be entered in lowercase.
Traditionally, the underscore character "_" is not used in names. In most cases, it is better to
use the dot character ".". Avoid using the underscore character as the first character in an object
name.
There are special characters in R:
NA: Not Available;
NaN: Not a Number, for example, uncertainty 0/0; Inf: Infinity;
-Inf: (Minus infinity).
You can exit R using q (). The no argument means that the session is not to be saved.
The R package works with the following data types: logical, integer, real, complex,
character, and list.
The analytical capabilities of the R package are due to the ability to work with various objects,
in particular:
1. Vectors are the most basic of R objects, and can contain only one type of data. You can
create a vector by using the c () function, which combines several elements of the same type. You can
also create a sequence using the: symbol or the seq () function. For example, 1:5 creates a vector
sequence of numbers from 1 to 5. The seq () function allows you to specify the interval between numbers.
You can repeat the pattern using the rep () function. You can also create a numeric vector with
missing values by using the numeric () function, or a character vector with character () or a logical
vector with logical ().
2. Factors are similar to vectors, but with a defined set of levels. Factor represents a nominal
or rank scale. It is used to represent Y in classification models, and the factor () function transforms a
vector into a factor. Also, factor can be sorted using the option ordered = T or the function ordered ().
3. Matrices are similar to vectors, but with specific instructions for output. A matrix is a two-
dimensional set of elements of the same type (table). If you want to create a matrix, one way is to use
the matrix () function. You enter a vector, a set of rows or columns, and you can tell R how to interpret
the data (by default, as columns).
The cbind () and rbind () functions combine vectors in a matrix by column or by row. The
dimension of the matrix can be obtained using the dim () function. Otherwise, the nrow () and ncol ()
functions return the number of rows and columns, respectively.
4. Arrays are similar to matrices, but can have more than two dimensions. An array is a
multidimensional set of elements of the same type. Array.
must be symmetrical in all dimensions. The vector objects that make up t h e array must be of the
same type, but not necessarily of numeric type.
5. A list is a vector for R objects. Lists are a collection of R objects. The list () function creates
a list; unlist () transforms a list into a vector. Mostly, it is convenient to store in the form of lists either
some data of the same type that corresponds to different iterations, for example, many models, or to store
heterogeneous data that have a semantic connection, for example, different statistical characteristics
of a single model.
6. A dataframe is similar to a matrix, but does not require all columns to be identical in
type. The structure is a list of variables/vectors of the same length. Data.frame - a two-dimensional
data set (table). Unlike matrices, columns in a data.frame can contain data of different types. However,
there can be only one data type within each column. This is because a data.frame is a list of vectors
(columns). Therefore, different functions can be applied to the data.frame.
7. Formulas are a special form of expressing relationships between variables in an equation.
Formulas are used when building models to determine the functional relationship between
parameters. The dot symbol "(.)" replaces all available variables.
8. Classes is a data type for a variable, and a variable of this type is an object - an instance
of the class). Classes are attached to objects as attributes. All objects in R have their own class, type,
and dimension.
All objects support naming the elements they contain. To do this, use the character " or ''.
Similarly to vectors, matrices and data.frames have such properties as rownames and colnames,
which allow you to change the names of columns and rows.
To delete names, you can assign a special type NULL.
Converting data types is done through a group of functions that are based on as.
Indexing in R is an effective and powerful tool for working with data.
Indexes can be numeric, boolean, and textual.
Three types of expressions are used for indexing: [ - selects
elements of a vector/list/array, etc;
$ - selects one element from the data.frame/list by its name;
[[- selects elements from a vector/list/array, etc. but discards names if they exist.
The indexing features allow you to change the position of items and duplicate them. To delete
elements by index value, a minus sign is added before them. To add a new element (column/row in the
data.frame) to the list, a new name or numeric index is used.
If an index is accessed that does not exist, the special value NA is returned.

2. Working in RStudio: graphical features.


Before you start working in RStudio, you need to import data. To enter data, in addition to the
manual mode, you can use the Environment window, the Inport Dataset button, and specify what type of
file to import and where from. We remind you that the file name must be in English and consist of one
word (Fig. 171).

Figure 171. Importing data into RStudio


The result is a data table containing information about the age and salary of the
bank's customers (Figure 172).

Figure 172. Input data for analysis in RStudio

If you want to name the columns or change them, go to the Untitled*1 script window and
run the command:
colnames(Salary)<-c('Age_years','Wage_th_UAH').
As a result, we get new names for the table columns (Figure 173).

Figure 173. Changing the names of table columns in RStudio


The R package (RStudio) has a powerful visualization unit. To work with graphics in R, you
need to install the appropriate package - ggplot2, that is, run the install.packages ("ggplot2")
command or select the Graphics package in the library and click the Install button.
In ggplot2, any type of infographic is the result of the interaction of a number of elements:
1) of the data array;
2) schemes of correspondence of variables to the array of visual means (aesthetic);
3) geometric object (geom);
4) statistical transformation (stat);
5) coordinate system (coord);
6) guide;
7) panels (facet);
8) artistic design (theme).
Visualize the original information space. To do this, use the command:
plot(Salary$Age_years,Salary$Wage_th_UAH) (Fig. 174).

Figure 174. Building a chart in RStudio

The user can also change the parameters of the dot diagram: select the type, color, and
thickness. To do this, run the command:
plot(Salary$Age_years,Salary$Wage_th_UAH,col='red',lwd=2) (Fig. 175).
Figure 175. Editing a graph in RStudio

To change the axis labels and name the dot plot, run the following command:
plot(Salary$Age_years,Salary$Wage_th_UAH,col='red',lwd
=2,xlab = 'Age_of_Bank_Customers',ylab = 'Salary_thousand_hryvnias',main
= 'Dependence_of_income_on_age') (Fig. 176).

Figure 176. Finished visualization in RStudio

The user can also change the type of chart. For example, let's turn a dot chart into a bar chart.
To do this, you need to change the geometric object to aes(xlab = 'Age of bank customers'
= cyl)) + geom_bar()+ coord_polar().
So, the source data has been downloaded and visualized.
3. Statistical analysis of data in the R package.
First, let's check the original information space for a normal distribution. R has four built-in
functions for generating a normal distribution:
dnorm(x, mean, sd);
pnorm(x, mean, sd);
qnorm(p, mean, sd);
rnorm(n, mean, sd).

Here is a description of the parameters used in the previously mentioned


functions:
x - a vector of numbers; p is a
probability vector;
n is the number of observations (sample size);
mean - the average value of the sample data. This value is zero by default;
sd - standard deviation. This value defaults to 1. So, to check the original information space for
normal
Given the distribution law, it is necessary to determine the general statistical characteristics (mode,
median, minimum, maximum, mean, first and third quartile) and variance for each data series. To do
this, use the following commands (Figure 177).

1) summary(Salary) 2) sd(Salary$Age_years)
sd(Salary$Wage_th_UAH)

Figure 177. Calculation of the characteristics of the distribution center

Now we can check the normality of the data distribution. First, for the "Age" distribution
series, we run certain commands and visualize the histogram of the distribution (Figure 178):
dnorm(Salary$Age_years,mean = 38.61, sd = 13.03452)
y<-dnorm(Salary$Age_years,mean = 38.61, sd = 13.03452) hist(y,col =
'blue', main = "Normal Distribution")
Figure 178. Checking the normality of the distribution for the "Age" series

We apply the same steps to the "Salary" column (Figure 179):


dnorm(Salary$Wage_th_UAH, mean = 6.712, sd = 2.463514)
b<-dnorm(Salary$Wage_th_UAH, mean = 6.712, sd = 2.463514) hist(b, col = 'red',
main = "Normal Distribution")

Figure 179. Checking the normality of the distribution


for the "Salary" series

The results of the test (see Figures 178 and 179) suggest that the distribution series conform
to the normal distribution law. So, we can proceed to the regression analysis in the R package.
Regression analysis is a widely used statistical tool to establish a model of the relationship
between two variables. One of these variables is called the independent predictor variable, whose value is
collected through experiments. The other variable is called the response variable (dependent
variable), whose value is inferred from the predictor variable.
In a linear regression, these two variables are related through the
equation, where the indicator (power) of both of these variables is 1. Mathematically, a linear
relationship
is a straight line on the graph. A nonlinear relationship, when the value of any variable is not equal to
1, creates a curve.
The general mathematical equation for linear regression:

y = ax + b.
where y is the response variable (dependent variable); x is the
predictor variable (independent);
a and b are constants called coefficients.

In order to build a model of the relationship (influence) and establish the dependence between
the age of bank customers and their salary level, we use the linear regression function lm () in the R
package. Syntax of the function: lm(formula,data).
The independent predictor variable is age, while the dependent variable is the salary of the
bank's clients. Therefore, we assign the age variable the value "x", and the salary variable the value
"y". To do this, perform the following steps:
x <- Salary$Age_years x
y <- Salary$Wage_th_UAH y
As a result, we get the corresponding entry in the console window (Figure 180).

Figure 180. Assigning values of x and y to variables

Let's determine the correlation coefficient, which will characterize the relationship
between the variables. To do this, execute the command cor.test(x=Salary$Age_years,
y=Salary$Wage_th_UAH) (Fig. 181).
Figure 181. Calculation of the correlation coefficient

The calculation (see Fig. 181) shows that there is a fairly strong direct relationship between
the variables (R = 0.779), i.e., as the age of bank customers increases, their income increases.
According to the initial information space, we perform a one-factor regression analysis and obtain
the result shown in Fig.182:
relation <- lm(y~x)
print(relation)

Fig. 182. Results of the regression analysis

The obtained values of the coefficients for the independent variable and free terms are
transformed into a model that looks like this:

Y = 1.0258 + 0.1473x.
To get a more detailed (complete) conclusion, use the
print(summary(relation)) function (Fig. 183).

Figure 183. Detailed results of the regression analysis

The previously shown window (see Fig. 183) contains information on the descriptive statistics
of the model residuals, as well as the result of calculating the values of the coefficients and the
variable X, the standard error, t-criterion, and significance level. The coefficient of determination is
0.607, i.e. 61 % of the change in Y is explained by the influence of X. In general, these results indicate
that the model is adequate.
Using the resulting regression model, it is possible to make a forecast, namely, by setting the
value of the independent variable, we can determine how the dependent variable will change.
To do this, use the predict() function. The syntax of the function: predict(object,
newdata).
So, let's find the predicted value of the salary for the bank's clients if their ages are 19 and 65.
To do this, create a new Data.frame (Figure 184):
new_Age_years <- data.frame(x = c(19,65)) new_Age_years
Figure 184. Creating a new Data.frame

We will fulfill forecast with using the command predict(relation,


new_Age_years) command (Fig. 185).

Fig. 185. Results of forecasting


In this way, the forecast values were obtained for the specified conditions. That is, if the age of
the bank's customers is 19 and 65, their approximate income will be UAH 3,823.7 and UAH
10,597.62, respectively.

Objectives:
1. Using the R package (RStudio):
1.1. Create a symbolic (text) vector containing the names of the disciplines studied in the
current semester.
1.2. Create a list that will contain information about the days of the week (the list should
contain three elements: the number of days (1 - 7), the name of the days of the week, and separately
defined days when students have pairs).
1.3. Create a vector that will display the comparison of the ECTS grading scale, which is
commonly used to assess the quality of student achievement, and the point scale that corresponds to
it.
1.4. Create a data.f r a m e that will display the largest cities in Ukraine with a population of more
than 500,000 people as of 2019.
1.5. Find a solution:
2. ing statistical data and graphical capabilities of the R package, build 3 different
types of charts and be able to edit them.
Write up the results in the form of a laboratory report.

Tasks for independent study

Questions for students' independent work:


1. Describe the analytical (statistical) capabilities of the R package.
2. Describe the graphical script editors and IDEs of the R package.
3. Provide a comparative description of the functionality of the R package and other
specialized software products of the EMB.
4. What R add-ons and libraries do you know?
5. What are the advantages of the R package in the work of a business analyst?
Task 1. There is a data.frame (structure):
df <- data.frame (var1 = c (11,21,31), var2 = c (12,22,32), var3 = c (13,23,33), var4 = c
(14,24,34) , row.names = c ("case1", "case2", "case3"))

df
var1 var2 var3 var4

case1 11 12 13 14
case2 21 22 23 24
case3 31 32 33 34

Select the values var1, var2, var3 for case1.


Select the values of all variables for case2 that are greater than 22.
Select the variable names for columns 1 and 3.
Add a column named Y with values -1, 0, 1. Delete the
line case2.
Bring the values of the second column to the third power.

Task 2. Using your own information space and the analytical capabilities of the R
package, check the initial data for a normal distribution law, build a single-factor (multifactor)
regression model and a forecast.

You might also like