R Poisson
R Poisson
R Poisson
Contents
Introduction .................................................................................................................................................. 1
Loading and Transforming Data................................................................................................................... 3
Data Description ........................................................................................................................................... 7
Estimating Regressions ................................................................................................................................ 8
Results Interpretation and Prediction ......................................................................................................... 9
Marginal Effect ........................................................................................................................................... 10
R-squared.................................................................................................................................................... 11
Introduction
In this example, I will try to build Poisson regression model that will predict the number of banking
outlets of specific commercial bank in a defined region. And I will try to test the hypothesis that state-
owned banks have statistically significant more outlets in the region due to the assumption that
managers of these banks more aggressively target the revenue rather than profit.
The original file “outlets.csv” represents 2017 year with the data on banks from bank.gov.ua and
macroeconomic indicators from ukrstat.gov.ua. It has 30 columns and 87 rows:
1. Name is the id of bank name (original names were in Cyrillic and Cyrillic is not native for R so I
replaced names with integer id-s). There are 83 banks (id from 1 to 83) in the sample. Row with
Name id = 84 is the summary (Total) row, the next row with id = 85 is the row with the
Population and the last one with id = 86 is the regional GDP:
2. Column Group shows whether the bank is private, foreign or state (for the id-s from 1 to 83; id-s
84, 85, and 86 are Total, Population, and GDP respectively for regions)
3. The next column Total is the total number of outlets in Ukraine for specific bank
4. Then, we have 25 columns with region names like vin is Vinnytsya, vol is Volyn and so on. Each
sell under region names (when column’s Name id is less than or equal to 83) represents the
2
number of outlets for some bank with id from column Name and region highlighted in
respective column, e.g. below green cell E4 says that some bank with id = 3 in column Name has
5 outlets in Volyn region:
5. The last two columns AC and AD are Assets and Profit of the banks.
3
library(sandwich)
library(reshape2)
library(MASS)
*If some of them are not installed yet, run install.packages("library_name") or see detailed description in “R
Installation“ tutorial.
Then, please specify the folder where file with data “outlets.csv” is copied and where the results will be
stored (use double slash):
folder = "Q:\\NaUKMA\\2020\\Eco\\R\\Poisson_Banking_Outlets"
setwd(folder)
setwd will set this folder as working directory and we will not need to specify the path to the file
anymore in the future. I.e., reading file from csv can be done as follows (and checking first and last
rows):
dat = read.csv("outlets.csv")
head(dat)
tail(dat, 5)
* If you cannot read the file, please check the separator – whether it is comma or smth else. You can open csv file
using Notepad and set the separator that separates the columns, e.g. instead of comma could be semicolon.
Now dat variable is a data.frame (note that it is not a data.table – we are not using data.table package in
this example) with the same column names as in the csv file.
str(dat)
You should see that columns are either integer (int), numeric (num) or character (chr):
4
For the regression, we want to have a data in the so-called long form where one observation represents
one bank in a specific region. Right now, we have data in the so-called wide form where we have one
observation (one row) that represents all regions of the bank. We can transform from wide to long form
using function melt:
First, via subset(dat, select = names(dat)[c(1, 4:28)]) we select first column (Name) and then from fourth
till 28th – basically columns with region names omitting columns Group, Total, Assets and Profit. Second,
we apply to this output head(…, -3) – it will drop last 3 rows – Total, Population, and GDP ones. Finally,
melt transforms this block into long format like below where we can, for example, check that fifth bank
in Dnipro region has one outlet since outlets[outlets$variable == "dni" & outlets$Name == 5,] returns 1.
5
The next step is to merge to this long form banks assets and profit:
Finally, need to add population and GDP. First, we will select them and transpose this object. Then
represent it as data.frame, rename columns, create new column based on row names (region names)
and in the end remove rownames:
Next, we are going to introduce dummy variables for state and foreign ownership omitting private one
to avoid perfect multicollinearity:
outlets = subset(outlets, select = c("value", "Assets", "Profit", "population", "gdp", "foreign", "state"))
head(outlets)
Data Description
Here you can run different functions that we have discussed during the course, e.g.:
summary(outlets)
hist(outlets$value, breaks = c(0, 1, 5, 10, 20, 30, 40, 50, 100, max(outlets$value)), freq = T)
You will see smth like this (can add labels to axes and a title):
8
Estimating Regressions
Let’s now estimate Poisson regression using glm function and specifying family parameter as “poisson”.
We can check its results via summary function. However, there is an issue that calculated standard
errors and, as a result, p-values are not robust. We will recalculate standard errors and p-values:
write.csv(r_est, "outlets_coefficients.csv")
r_est
Estimate is a column of coefficients, Pr (>|z|) are p-values and the last two columns are lower and upper
limits of confidence interval of the coefficients. p-values are interpreted in the same way as always.
9
Our null hypothesis that state variable has positive statistically significant impact on the number of
outlets can be rejected since even though the coefficient at state is statistically significant at 90%
confidence level, it has negative sign meaning the relationship is the opposite. It could be probably
explained by the fact that sample may include state-owned bank Ukrsimbank which has very few
outlets.
We can try to predict average number of outlets in the region by creating average private bank and
using it for prediction:
#Create average bank for average region
#Predict average numbers of outlets in the average region for average bank
phat
The result 1.6 means that such bank on average will have 1.6 outlets per region.
Let’s assume that some foreign investor wants to establish a new bank with assets = 50 millions,
expected profit for the first year -2 millions. How many outlets such bank will open in the region with
population of 3 million of people and GDP = 0.1 (billion)? We can create such bank as one-row
data.frame and make prediction for it:
#Another example: not average but some specific bank in specific region
my_bank = data.frame(Assets = 50, Profit = -2, population = 3, gdp = 0.1, foreign = 1, state = 0)
Most likely, such bank will open on average 7.6 outlets in this type of regions.
10
Marginal Effect
Marginal effect for Poisson regression can be calculated as follows – we are doing it for average private
bank and for specific new bank:
#Marginal effects: by how much the number of outlets in the average region for average bank
#changes when x increases by 1
ME = exp(sum(cbind(1, ave_bank) * r_est[, 1])) * r_est[, 1][-1]
ME
write.csv(ME, "outlets_ME.csv")
ME interpretation is the same as we described for other models, e.g. for our special bank, if assets
increase by 1 (by 1 million), then this bank will open additionally 0.26 outlet per regions:
11
R-squared
We will calculate McFadden R-squared (need to run regression against constant – specify dependent
variable as 1 – and extract log-likelihood values and use them in the formula) and usual R-squared based
on the sum of residuals:
#R-squared
#McFadden R-squared
m1_const = glm(value ~ 1, data=outlets, family="poisson")
Both R-squared are pretty high: around 75% of variation in the data is explained by the model.