0% found this document useful (0 votes)

9 views

Applying Statistical Methods To Library Data Analysis

Uploaded by

Rizkia Company

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Applying Statistical Methods To Library Data Analysis

Uploaded by

Rizkia Company

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

The Serials Librarian

From the Printed Page to the Digital Age

ISSN: 0361-526X (Print) 1541-1095 (Online) Journal homepage: https://www.tandfonline.com/loi/wser20

Applying Statistical Methods to Library Data

Analysis

Yongming Wang & Jia Mi

To cite this article: Yongming Wang & Jia Mi (2019) Applying Statistical Methods to Library Data
Analysis, The Serials Librarian, 76:1-4, 195-200, DOI: 10.1080/0361526X.2019.1590774

To link to this article: https://doi.org/10.1080/0361526X.2019.1590774

Published with license by Taylor & Francis

Group, LLC© 2019 Yongming Wang and Jia
Mi

Published online: 07 May 2019.

Submit your article to this journal

Article views: 1730

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=wser20
THE SERIALS LIBRARIAN
2019, VOL. 76, NOS. 1–4, 195–200
https://doi.org/10.1080/0361526X.2019.1590774

Applying Statistical Methods to Library Data Analysis

Yongming Wang and Jia Mi
Presenters

ABSTRACT KEYWORDS
This article describes two projects at The College of New Jersey Library that referential statistics; data
used statistical methods and the programming language R to analyze analytics; linear regression;
electronic usage data. Project one utilized simple linear regression to Pearson correlation;
model the monthly database search and full text usage data of eight electronic resources usage
years from 2010 to 2017. In project two, Pearson correlation was applied
to analyzing the relationship between the journal’s impact factor and the
journal’s full text downloads. It was found that there is a moderate positive
relationship between these two variables.

Introduction
Due to user preferences for the electronic format, libraries of all sizes have embraced e-resources
(ER) as an essential part of a library collection. Thus, applying usage statistics to library collection
development decision making has become even more important. However, we do not see many
successful examples of utilizing ER usage data. Although there are plenty of case studies and
examples presented at professional conferences, and published articles about library data analytics
in recent years, most of them deal with simple data analysis and data visualization created with tools
such as Microsoft Excel, Tableau, and LibInsight. Although these methods are good to visualize the
trends of multiple-year usage, it is difficult to explore the relationships among different variables in
a scientific and mathematical way.
In our pilot data analytics project, we used mathematical statistics to analyze our library’s ER
usage data in order to develop in-depth understanding of the relationship among different data
elements. Specifically, we chose two classical statistical methods: linear regression and Pearson
correlation. Our goals were: (1) to model the ER usage data mathematically and scientifically; (2)
to quantify the usage trend and to predict the future; (3) to identify the relationship between
different data factors; and (4) to position our pioneering work as something that will help
integrate evidence-based decisions in collection management and other library operations in the
future.
There are a variety of statistical software available, such as SAS, SPSS, MatLab, and R. We chose
R for our project. R is both a programming language and an environment for statistical computing
and graphics, with a rich set of packages including its standard package and many additional
packages to handle a variety of statistical computing and graphics tasks. R software is freely available
under the GNU General Public License. In our first project, we used an R package called “astsa”
developed by Shumway and Stoffer. The astsa package is mainly used for time series data analysis.
However, it is also useful in linear regression. In our second project, we applied the Pearson
correlation coefficient method by using the standard R package. The actual tool we used is called
RStudio. It is freely available and is the most popular Integrated Development Environment for R.

CONTACT Jia Mi jmi@tcnj.edu The College of New Jersey Library, 2000 Pennington Road, Ewing, NJ 08628, USA.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/wser.
Published with license by Taylor & Francis Group, LLC
© 2019 Yongming Wang and Jia Mi
196 CONCURRENT SESSION

Project one: Simple linear regression

In statistics, linear regression is a linear approach to modelling the relationship between an
independent variable and a dependent variable by fitting a linear equation to observed data. The
independent variable is also called an explanatory variable and the dependent variable is called
a response variable. The simple linear regression has only one independent variable and one
dependent variable, while the multiple linear regression has two or more independent variables
and one response variable.
For simple linear regression, the linear equation is as follows:

Y ¼ a þ bX ðY : dependent variable; X : independent variable; a : intercept; b : slopeÞ

In Figure 1, each dot represents a pair of observed data (x, y), and the straight line is the fitting
line of equation Y ¼ a þ bX.
In this project, we used EBSCO’s Academic Search Premier (ASP) for our data analysis. We
collected monthly search and full text (FT) data for the period from January 2010 to December 2017,
a total of ninety-six months’ worth of data for each. See Figure 2 for sample data.

Please see the code for this project in Appendix A. Following is the result and the interpretation:
Result:
Call:
lm(formula = asp_search_8years ~ time(asp_search_8years))
Residuals:
Min 1Q Median 3Q Max
−16844.0 −9324.9 201.1 7714.0 21578.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2769587.2 912962.7 3.034 0.00312 **
time(asp_search_8years) −1367.6 453.3 −3.017 0.00329 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10260 on 94 degrees of freedom
Multiple R-squared: 0.08827, Adjusted R-squared: 0.07857
F-statistic: 9.101 on 1 and 94 DF, p-value: .003286
Interpretation:
The linear regression equation: Y = a + Bx

Figure 1. An illustration of the linear regression equation.

THE SERIALS LIBRARIAN 197

Figure 2. Project one data sample.

In our case, Y is the Search data, X is the time (Year), a (intercept) = 2769587, b (slope) = −1367.
Therefore: Search = 2769587 – 1367 Year
Report: The linear regression analysis indicated that the time factor (month/year) was signifi-
cantly related to the ASP search data. As the time increases by one year, the search decreases by
1367. (R-square = 0.09, adjusted R-square = 0.08, F[1, 94] = 9.10, p < .05).
Applying the same mechanism to ASP FT download data, we got the following result:
Call:
lm(formula = asp_ft_8years ~ time(asp_ft_8years))
Residuals:
Min 1Q Median 3Q Max
−8358 −4209 −81 3162 10944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222385.0 424927.5 2.877 0.00497 **
time(asp_ft_8years) −603.5 211.0 −2.860 0.00522 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4774 on 94 degrees of freedom
Multiple R-squared: 0.08006, Adjusted R-squared: 0.07027
F-statistic: 8.18 on 1 and 94 DF, p-value: .005218
Result Interpretation: FT = 1222385 – 603 Year
Reports: The linear regression analysis indicated that the time factor (month/year) was signifi-
cantly related to ASP full text downloads. As the time increases by one year, the full text downloads
decreases by 603. (R-square = 0.08, adjusted R-square = 0.07, F[1, 94] = 8.18, p < .5).
198 CONCURRENT SESSION

Figure 3. Scatterplot of search and download data.

Figure 3 shows the scatterplots with fitting lines for both search and FT downloads data.

Project two: Pearson correlation coefficient

Our second project is the application of the Pearson correlation coefficient to the library’s online
subscription journals. We wanted to find out whether there is any relationship between the journal’s
impact factor and the journal’s usage, which we define as full text downloads. If there is indeed
a relationship between these two variables, then what is the degree of the relationship?
But first let’s introduce the Pearson correlation coefficient (r). The Pearson correlation coefficient
(r) is used to measure the degree of relationship between two continuous variables. A continuous
variable is a variable that can be measured along a line scale, such as the temperature or a person’s
age and weight. The “r” value ranges between negative one and positive one (−1 < r < 1). Following
are the various degrees of relationships based on the r value, which are agreed on by most
statisticians:

● −0.99 to −0.6, strong negative relationship

● −0.59 to −0.3, moderate negative relationship
● −0.29 to 0.29, weak relation (0 means no relationship)
● 0.3 to 0.59, moderate positive relationship
● 0.6 to 0.99, strong positive relationship
● Special cases: 0: no relationship; −1: perfect negative; +1: perfect positive

There is a strong warning, which is: correlation does not imply causation! In other words, there is
not necessarily a cause and effect relationship between the two variables.
In our project, we collected the five-year average of full text downloads for the library’s subscribed
online journals. Currently The College of New Jersey Library directly subscribes to about 800
electronic journals. By searching the Journal Citation Reports from Clarivate Analytics, 390 of 800
journals have an impact factor. See Figure 4 for sample data.

Please see the code in Appendix B. Following is the result and interpretation:
Result:
Pearson’s product-moment correlation
data: impact$`Impact Factor` and impact$`FT Download`
t = 7.5775, df = 387, p-value = 2.627e-13 ( = 0.0000000000002627)
alternative hypothesis: true correlation is not equal to 0 (Null Hypothesis is rejected.)
THE SERIALS LIBRARIAN 199

95 percent confidence interval:

0.2696482 0.4430372
sample estimates:
cor
0.3594411 (r = 0.36) (moderate positive relationship)
(r is statistically significant at p < .001)
Interpretation: There is a moderate positive relationship (r = 0.36) between the journal’s down-
loads and the journal impact factor. This relationship is statistically significant. In other words, there
is a less than 5% probability that the finding could have been due to chance.

Figure 5 is the scatterplot for the 390 pairs of data of journal impact factor and full text
downloads.

Figure 4. Project two data sample.

Figure 5. Scatterplot of impact factor and download data.

200 CONCURRENT SESSION

Conclusion
By applying the statistical methods to library data analysis, we are able to discover some relationships
among different variables, and to gain accurate trend information through the mathematical model.
Hopefully, this approach can be used to predict the trend in our future projects. These two projects
are only preliminary. There are some limitations. For instance, our data size is relatively small and
the data themselves are fairly neat and clean. So we did not go through the complex data cleaning
and preparation process that other projects might require. Another limitation is we used linear
regression in our project, but not the most useful and popular methods of time series analysis, such
as Autoregression Integrated Moving Average.
For a possible future project, we would like to collect library user data and conduct correlation
analysis of user data versus usage data, which we hope will reveal additional information about our
electronic resources and how our users engage with them. Also, we plan to use multiple linear
regression methods to see if we can discover more accurate relationships among variables and to
model the trends.

Disclosure statement
No potential conflict of interest was reported by the authors.

Appendix A
Code for Project One: Simple Linear Regression
# import data
> library(readxl)
> ASP_SEARCH_8years <- read_excel(“~/TCNJ/conference/NASIG/data/ASP_SEARCH-8years.xlsx”,col_types = c
(“numeric”))
> View(ASP_SEARCH_8years)
> ASP_FT_8years <- read_excel(“~/TCNJ/conference/NASIG/data/ASP_FT-8years.xlsx”,col_types = c(“numeric”))

#making time series object

> asp_search_8years <- ts(data = ASP_SEARCH_8years, frequency = 12, start = c(2010,1))
> asp_ft_8years <- ts (data = ASP_FT_8years, frequency = 12, start = c(2010, 1))

# Drawing the scatterplot, doing the linear regression, and fitting the line
> library(astsa)
> par(mfrow = c(2,1))
> tsplot(asp_search_8years, type = “o”, xlab = “Year/Month”, main = “ASP Search 8 Years”)
> fit <- lm(asp_search_8years~time(asp_search_8years))
> abline(fit)
> summary(fit)

Appendix B
Code for Project Two: Pearson Correlation Coefficient
> library(readxl)
> impact <- read_excel(“~/TCNJ/conference/NASIG/data/journal/impact.xlsx”)
> View(impact)
> pairs(impact)
> plot(impact$`Impact Factor`, impact$`FT Download`)
(see scatterplot: Impact_Factor-Download_scatterplot.jpeg)
> res <- cor.test(impact$`Impact Factor`, impact$`FT Download`, method = “pearson”)
> res