Applying Statistical Methods To Library Data Analysis
Applying Statistical Methods To Library Data Analysis
To cite this article: Yongming Wang & Jia Mi (2019) Applying Statistical Methods to Library Data
Analysis, The Serials Librarian, 76:1-4, 195-200, DOI: 10.1080/0361526X.2019.1590774
ABSTRACT KEYWORDS
This article describes two projects at The College of New Jersey Library that referential statistics; data
used statistical methods and the programming language R to analyze analytics; linear regression;
electronic usage data. Project one utilized simple linear regression to Pearson correlation;
model the monthly database search and full text usage data of eight electronic resources usage
years from 2010 to 2017. In project two, Pearson correlation was applied
to analyzing the relationship between the journal’s impact factor and the
journal’s full text downloads. It was found that there is a moderate positive
relationship between these two variables.
Introduction
Due to user preferences for the electronic format, libraries of all sizes have embraced e-resources
(ER) as an essential part of a library collection. Thus, applying usage statistics to library collection
development decision making has become even more important. However, we do not see many
successful examples of utilizing ER usage data. Although there are plenty of case studies and
examples presented at professional conferences, and published articles about library data analytics
in recent years, most of them deal with simple data analysis and data visualization created with tools
such as Microsoft Excel, Tableau, and LibInsight. Although these methods are good to visualize the
trends of multiple-year usage, it is difficult to explore the relationships among different variables in
a scientific and mathematical way.
In our pilot data analytics project, we used mathematical statistics to analyze our library’s ER
usage data in order to develop in-depth understanding of the relationship among different data
elements. Specifically, we chose two classical statistical methods: linear regression and Pearson
correlation. Our goals were: (1) to model the ER usage data mathematically and scientifically; (2)
to quantify the usage trend and to predict the future; (3) to identify the relationship between
different data factors; and (4) to position our pioneering work as something that will help
integrate evidence-based decisions in collection management and other library operations in the
future.
There are a variety of statistical software available, such as SAS, SPSS, MatLab, and R. We chose
R for our project. R is both a programming language and an environment for statistical computing
and graphics, with a rich set of packages including its standard package and many additional
packages to handle a variety of statistical computing and graphics tasks. R software is freely available
under the GNU General Public License. In our first project, we used an R package called “astsa”
developed by Shumway and Stoffer. The astsa package is mainly used for time series data analysis.
However, it is also useful in linear regression. In our second project, we applied the Pearson
correlation coefficient method by using the standard R package. The actual tool we used is called
RStudio. It is freely available and is the most popular Integrated Development Environment for R.
CONTACT Jia Mi jmi@tcnj.edu The College of New Jersey Library, 2000 Pennington Road, Ewing, NJ 08628, USA.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/wser.
Published with license by Taylor & Francis Group, LLC
© 2019 Yongming Wang and Jia Mi
196 CONCURRENT SESSION
In Figure 1, each dot represents a pair of observed data (x, y), and the straight line is the fitting
line of equation Y ¼ a þ bX.
In this project, we used EBSCO’s Academic Search Premier (ASP) for our data analysis. We
collected monthly search and full text (FT) data for the period from January 2010 to December 2017,
a total of ninety-six months’ worth of data for each. See Figure 2 for sample data.
Please see the code for this project in Appendix A. Following is the result and the interpretation:
Result:
Call:
lm(formula = asp_search_8years ~ time(asp_search_8years))
Residuals:
Min 1Q Median 3Q Max
−16844.0 −9324.9 201.1 7714.0 21578.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2769587.2 912962.7 3.034 0.00312 **
time(asp_search_8years) −1367.6 453.3 −3.017 0.00329 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10260 on 94 degrees of freedom
Multiple R-squared: 0.08827, Adjusted R-squared: 0.07857
F-statistic: 9.101 on 1 and 94 DF, p-value: .003286
Interpretation:
The linear regression equation: Y = a + Bx
In our case, Y is the Search data, X is the time (Year), a (intercept) = 2769587, b (slope) = −1367.
Therefore: Search = 2769587 – 1367 Year
Report: The linear regression analysis indicated that the time factor (month/year) was signifi-
cantly related to the ASP search data. As the time increases by one year, the search decreases by
1367. (R-square = 0.09, adjusted R-square = 0.08, F[1, 94] = 9.10, p < .05).
Applying the same mechanism to ASP FT download data, we got the following result:
Call:
lm(formula = asp_ft_8years ~ time(asp_ft_8years))
Residuals:
Min 1Q Median 3Q Max
−8358 −4209 −81 3162 10944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222385.0 424927.5 2.877 0.00497 **
time(asp_ft_8years) −603.5 211.0 −2.860 0.00522 **
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4774 on 94 degrees of freedom
Multiple R-squared: 0.08006, Adjusted R-squared: 0.07027
F-statistic: 8.18 on 1 and 94 DF, p-value: .005218
Result Interpretation: FT = 1222385 – 603 Year
Reports: The linear regression analysis indicated that the time factor (month/year) was signifi-
cantly related to ASP full text downloads. As the time increases by one year, the full text downloads
decreases by 603. (R-square = 0.08, adjusted R-square = 0.07, F[1, 94] = 8.18, p < .5).
198 CONCURRENT SESSION
Figure 3 shows the scatterplots with fitting lines for both search and FT downloads data.
There is a strong warning, which is: correlation does not imply causation! In other words, there is
not necessarily a cause and effect relationship between the two variables.
In our project, we collected the five-year average of full text downloads for the library’s subscribed
online journals. Currently The College of New Jersey Library directly subscribes to about 800
electronic journals. By searching the Journal Citation Reports from Clarivate Analytics, 390 of 800
journals have an impact factor. See Figure 4 for sample data.
Please see the code in Appendix B. Following is the result and interpretation:
Result:
Pearson’s product-moment correlation
data: impact$`Impact Factor` and impact$`FT Download`
t = 7.5775, df = 387, p-value = 2.627e-13 ( = 0.0000000000002627)
alternative hypothesis: true correlation is not equal to 0 (Null Hypothesis is rejected.)
THE SERIALS LIBRARIAN 199
Figure 5 is the scatterplot for the 390 pairs of data of journal impact factor and full text
downloads.
Conclusion
By applying the statistical methods to library data analysis, we are able to discover some relationships
among different variables, and to gain accurate trend information through the mathematical model.
Hopefully, this approach can be used to predict the trend in our future projects. These two projects
are only preliminary. There are some limitations. For instance, our data size is relatively small and
the data themselves are fairly neat and clean. So we did not go through the complex data cleaning
and preparation process that other projects might require. Another limitation is we used linear
regression in our project, but not the most useful and popular methods of time series analysis, such
as Autoregression Integrated Moving Average.
For a possible future project, we would like to collect library user data and conduct correlation
analysis of user data versus usage data, which we hope will reveal additional information about our
electronic resources and how our users engage with them. Also, we plan to use multiple linear
regression methods to see if we can discover more accurate relationships among variables and to
model the trends.
Disclosure statement
No potential conflict of interest was reported by the authors.
Appendix A
Code for Project One: Simple Linear Regression
# import data
> library(readxl)
> ASP_SEARCH_8years <- read_excel(“~/TCNJ/conference/NASIG/data/ASP_SEARCH-8years.xlsx”,col_types = c
(“numeric”))
> View(ASP_SEARCH_8years)
> ASP_FT_8years <- read_excel(“~/TCNJ/conference/NASIG/data/ASP_FT-8years.xlsx”,col_types = c(“numeric”))
# Drawing the scatterplot, doing the linear regression, and fitting the line
> library(astsa)
> par(mfrow = c(2,1))
> tsplot(asp_search_8years, type = “o”, xlab = “Year/Month”, main = “ASP Search 8 Years”)
> fit <- lm(asp_search_8years~time(asp_search_8years))
> abline(fit)
> summary(fit)
Appendix B
Code for Project Two: Pearson Correlation Coefficient
> library(readxl)
> impact <- read_excel(“~/TCNJ/conference/NASIG/data/journal/impact.xlsx”)
> View(impact)
> pairs(impact)
> plot(impact$`Impact Factor`, impact$`FT Download`)
(see scatterplot: Impact_Factor-Download_scatterplot.jpeg)
> res <- cor.test(impact$`Impact Factor`, impact$`FT Download`, method = “pearson”)
> res