PW1 2
PW1 2
PW1 2
Objective: to acquire skills in statistical data processing and graphics using the R
programming language.
Objectives: to get acquainted with the interface and functionality of R, to learn how to
process, analyze, and visualize various types of data using the R package.
Methodological recommendations
To make it easier for the user to work with the R package, an interface program called RStudio
was created. Note that working with R and RStudio is almost identical. For example, RStudio provides
many convenient tools, and it has a pleasant and more understandable interface. To download it, you
must first install the R package and then download the corresponding file:
https://www.rstudio.com/products /rstudio/download
/#download.
After launching the program, a window consisting of 4 blocks will open (Fig. 166).
The RStudio environment consists of windows:
1) script (for writing code and preparing it for launch);
2) console (displays all commands and the results of their execution);
3) Workspace (Environment) - to display all objects and their descriptions. The History tab
shows the history of commands entered by the user;
4) Files (displays the working directory - folder), Plots (for visualizing all charts),
Packages (shows all installed library packages), Help (for reference information).
Figure 166. The RStudio startup dialog box
For example, using the R package as a regular calculator, we can write certain mathematical
actions in the script window and press the Run button or use the Cntr + Enter button combination
(Figure 167).
After executing the code, the calculation result appeared in the console window, and the
History tab simultaneously displayed all the commands that the user entered.
All functions are entered accordingly.
If you don't know what kind of function you are working with, you should call up the help.
To do this, in the script window, enter a question mark before the function name and click the Run
button. All the necessary reference information will be displayed in the Help window (tab) (Figure
168).
The user can create variables. To do this, enter the variable designation in the console
window, then the assignment sign (<-) and the value to which it corresponds. For example, y = 100
(Fig. 169).
If you want to name the columns or change them, go to the Untitled*1 script window and
run the command:
colnames(Salary)<-c('Age_years','Wage_th_UAH').
As a result, we get new names for the table columns (Figure 173).
The user can also change the parameters of the dot diagram: select the type, color, and
thickness. To do this, run the command:
plot(Salary$Age_years,Salary$Wage_th_UAH,col='red',lwd=2) (Fig. 175).
Figure 175. Editing a graph in RStudio
To change the axis labels and name the dot plot, run the following command:
plot(Salary$Age_years,Salary$Wage_th_UAH,col='red',lwd
=2,xlab = 'Age_of_Bank_Customers',ylab = 'Salary_thousand_hryvnias',main
= 'Dependence_of_income_on_age') (Fig. 176).
The user can also change the type of chart. For example, let's turn a dot chart into a bar chart.
To do this, you need to change the geometric object to aes(xlab = 'Age of bank customers'
= cyl)) + geom_bar()+ coord_polar().
So, the source data has been downloaded and visualized.
3. Statistical analysis of data in the R package.
First, let's check the original information space for a normal distribution. R has four built-in
functions for generating a normal distribution:
dnorm(x, mean, sd);
pnorm(x, mean, sd);
qnorm(p, mean, sd);
rnorm(n, mean, sd).
1) summary(Salary) 2) sd(Salary$Age_years)
sd(Salary$Wage_th_UAH)
Now we can check the normality of the data distribution. First, for the "Age" distribution
series, we run certain commands and visualize the histogram of the distribution (Figure 178):
dnorm(Salary$Age_years,mean = 38.61, sd = 13.03452)
y<-dnorm(Salary$Age_years,mean = 38.61, sd = 13.03452) hist(y,col =
'blue', main = "Normal Distribution")
Figure 178. Checking the normality of the distribution for the "Age" series
The results of the test (see Figures 178 and 179) suggest that the distribution series conform
to the normal distribution law. So, we can proceed to the regression analysis in the R package.
Regression analysis is a widely used statistical tool to establish a model of the relationship
between two variables. One of these variables is called the independent predictor variable, whose value is
collected through experiments. The other variable is called the response variable (dependent
variable), whose value is inferred from the predictor variable.
In a linear regression, these two variables are related through the
equation, where the indicator (power) of both of these variables is 1. Mathematically, a linear
relationship
is a straight line on the graph. A nonlinear relationship, when the value of any variable is not equal to
1, creates a curve.
The general mathematical equation for linear regression:
y = ax + b.
where y is the response variable (dependent variable); x is the
predictor variable (independent);
a and b are constants called coefficients.
In order to build a model of the relationship (influence) and establish the dependence between
the age of bank customers and their salary level, we use the linear regression function lm () in the R
package. Syntax of the function: lm(formula,data).
The independent predictor variable is age, while the dependent variable is the salary of the
bank's clients. Therefore, we assign the age variable the value "x", and the salary variable the value
"y". To do this, perform the following steps:
x <- Salary$Age_years x
y <- Salary$Wage_th_UAH y
As a result, we get the corresponding entry in the console window (Figure 180).
Let's determine the correlation coefficient, which will characterize the relationship
between the variables. To do this, execute the command cor.test(x=Salary$Age_years,
y=Salary$Wage_th_UAH) (Fig. 181).
Figure 181. Calculation of the correlation coefficient
The calculation (see Fig. 181) shows that there is a fairly strong direct relationship between
the variables (R = 0.779), i.e., as the age of bank customers increases, their income increases.
According to the initial information space, we perform a one-factor regression analysis and obtain
the result shown in Fig.182:
relation <- lm(y~x)
print(relation)
The obtained values of the coefficients for the independent variable and free terms are
transformed into a model that looks like this:
Y = 1.0258 + 0.1473x.
To get a more detailed (complete) conclusion, use the
print(summary(relation)) function (Fig. 183).
The previously shown window (see Fig. 183) contains information on the descriptive statistics
of the model residuals, as well as the result of calculating the values of the coefficients and the
variable X, the standard error, t-criterion, and significance level. The coefficient of determination is
0.607, i.e. 61 % of the change in Y is explained by the influence of X. In general, these results indicate
that the model is adequate.
Using the resulting regression model, it is possible to make a forecast, namely, by setting the
value of the independent variable, we can determine how the dependent variable will change.
To do this, use the predict() function. The syntax of the function: predict(object,
newdata).
So, let's find the predicted value of the salary for the bank's clients if their ages are 19 and 65.
To do this, create a new Data.frame (Figure 184):
new_Age_years <- data.frame(x = c(19,65)) new_Age_years
Figure 184. Creating a new Data.frame
Objectives:
1. Using the R package (RStudio):
1.1. Create a symbolic (text) vector containing the names of the disciplines studied in the
current semester.
1.2. Create a list that will contain information about the days of the week (the list should
contain three elements: the number of days (1 - 7), the name of the days of the week, and separately
defined days when students have pairs).
1.3. Create a vector that will display the comparison of the ECTS grading scale, which is
commonly used to assess the quality of student achievement, and the point scale that corresponds to
it.
1.4. Create a data.f r a m e that will display the largest cities in Ukraine with a population of more
than 500,000 people as of 2019.
1.5. Find a solution:
2. ing statistical data and graphical capabilities of the R package, build 3 different
types of charts and be able to edit them.
Write up the results in the form of a laboratory report.
df
var1 var2 var3 var4
case1 11 12 13 14
case2 21 22 23 24
case3 31 32 33 34
Task 2. Using your own information space and the analytical capabilities of the R
package, check the initial data for a normal distribution law, build a single-factor (multifactor)
regression model and a forecast.