Citation Analysis: A Comparison of Google Scholar, Scopus, and Web of Science
Citation Analysis: A Comparison of Google Scholar, Scopus, and Web of Science
Citation Analysis: A Comparison of Google Scholar, Scopus, and Web of Science
Lokman I. Meho
School of Library and Information Science, Indiana University 1320 East 10th
St., LI 011; Bloomington, IN 47405; Tel: 812-855-2018 meho@indiana.edu
When faculty members are evaluated, they are judged in part by the impact and
quality of their scholarly publications. While all academic institutions look to
publication counts and venues as well as the subjective opinions of peers, many
hiring, tenure, and promotion committees also rely on citation analysis to obtain a
more objective assessment of an author’s work. Consequently, faculty members try
to identify as many citations to their published works as possible to provide a
comprehensive assessment of their publication impact on the scholarly and
professional communities. The Institute for Scientific Information’s (ISI) citation
databases, which are widely used as a starting point if not the only source for
locating citations, have several limitations that may leave gaps in the coverage of
citations to an author’s work. This paper presents a case study comparing citations
found in Scopus and Google Scholar with those found in Web of Science (the portal
used to search the three ISI citation databases) for items published by two Library
and Information Science full-time faculty members. In addition, the paper presents a
brief overview of a prototype system called CiteSearch, which analyzes combined
data from multiple citation databases to produce citation-based quality evaluation
measures.
Introduction
Citation analysis, along with peer judgment and assessments of publication counts and
venues, is one of the most widely used methods in evaluating the research performance of
scholars (Lewison, 2001; Thomas & Watkins, 1998). Researchers and administrators at
many academic institutions worldwide make use of citation data for hiring, promotion, and
tenure decisions, among others (Wallin, 2005). Citation counts provide researchers and
administrators with a reliable and efficient indicator for assessing the research performance
of authors, projects, programs, institutions, and countries and the relative impact and
quality of their work (Cronin, 1984; van Raan, 2005). The use of citation counts for
evaluating research is based on the assumption that citations are a way of giving credit to
and recognizing the value, quality, and significance of an author’s work (Borgman & Furner,
2002; van Raan, 1996).
Many scholars have argued for and some against the use of citations for assessing
research quality (Borgman & Furner, 2002). While the proponents have reported the validity
of citation counts in research assessments as well as the positive correlation between
these counts and peer reviews and assessments of publication venues (Aksnes & Taxt,
2004; Glänzel, 1996; Holmes & Oppenheim, 2001; Kostoff, 1996; Martin, 1996; Schloegl &
Stock, 2004; So, 1998; van Raan, 2000), critics claim that citation counting has serious
problems or limitations that impact its validity (MacRoberts & MacRoberts, 1989, 1996;
Seglen, 1998). Important limitations reported in the literature focus on, among other things,
the problems associated with the data sources used, especially Web of Science-the
standard and most widely used tool for generating citation data for research assessment
purposes. Critics note that Web of Science: (1) cover mainly English-language journal
articles published in the United States, United Kingdom, and Canada; (2) are limited to
citations from journals and papers indexed in the ISI database; (3) provide different
coverage between research fields; (4) do not count citations from books and other non-ISI
sources; and (5) have citing errors (e.g., homonyms, synonyms, and inconsistency in the
use of initials and in the spelling of non-English names) (Lewison, 2001; Reed, 1995;
Seglen, 1998). For a detailed summary of potentials and pitfalls of citation analysis for
research assessment, see Wallin (2005).
Although there are many databases and services that could be used to answer the
abovementioned research questions, the current study focuses on comparing Scopus and
Google Scholar with Web of Science. Scopus and Google Scholar were chosen because of
their similarity to Web of Science in that they were created primarily for citation searching
while at the same time can be used for bibliographic searching as well, among other things.
Scopus and Google Scholar were also chosen because they represent major competitors
to Web of Science in the field of citation analysis and bibliometrics. Currently, there are no
general, comprehensive databases or services that represent a major challenge to Web of
Science as the citation analysis tool than Scopus and Google Scholar.
Method
Search Tools
This study compares Scopus and Google Scholar with Web of Science for locating citations
to individual papers and authors. As mentioned earlier, Web of Science, which comprises
the three ISI citation databases, has been the standard tool for a significant portion of all
citation studies worldwide. Its website provides substantial factual information about the
database, including the number of records and lists of journals indexed. It also offers
powerful features for browsing, searching, sorting and saving functions, as well as
exporting to citation management software. Coverage in Web of Science goes back to 1945
for Science Citation Index, 1956 for Social Sciences Citation Index, and 1975 for Arts &
Humanities Citation Index. As of January 2006, there were over 35 million records in the
database from approximately 8,700 scholarly journals (including open access ones) and a
number of refereed conference proceedings. Subjects covered in Web of Science include
all disciplines one can think of or find in the curricula of universities in arts, humanities,
sciences, and social sciences. For more details on Web of Science, see Goodman and
Deis (2005) and Jacso (2005a).
Similar to ISI, Elsevier, the producer of Scopus, provides substantial factual information
about the database, including the number of records and lists of journals indexed
(http://www.info.scopus.com/
). It also offers powerful features for browsing, searching, sorting, and saving functions, as
well as exporting to citation management software. Coverage in Scopus goes back to 1966
(1996 for citations). In 2005, there were over 27 million records in the database from 14,200
titles broken down as follows: 12,850 academic journals including coverage of 535 Open
Access journals, 750 conference proceedings, and 600 trade publications. Subject areas
covered in Scopus include: Chemistry, Physics, Mathematics, and Engineering (4,500
titles), Life and Health Sciences (5,900 titles-100% Medline coverage), Social Sciences,
Psychology, and Economics (2,700 titles), Biological, Agricultural, and Environmental
Sciences (2,500 titles), and General Sciences (50 titles). For more details on Scopus, see
Goodman and Deis (2005) and Jacso (2005a).
In contrast to ISI and Elsevier, Google does not offer a publisher list, journal list, or any
information about the time-span or the refereed status of records in Google Scholar. This
and other studies, however, have found that Google Scholar covers print and electronic
journals, conference proceedings, books, theses, dissertations, preprints, abstracts, and
technical reports available from major academic publishers, distributors, aggregators,
professional societies, government agencies, and preprint/reprint repositories at
universities, as well as those available across the web (Bauer & Bakkalbasi, 2005; Gardner
& Eng, 2005; Jacso, 2005b; Wleklinski, 2005). Examples of these sources include: The
American Physical Society, Annual Reviews, arXiv.org, Association for Computing
Machinery (ACM), Blackwell, Cambridge Scientific Abstracts (CSA), HighWire Press,
Ingenta, Institute of Electrical and Electronics Engineers (IEEE), Macmillan, Meta Press,
NASA Astrophysics Data System (ADS), National Institute of Health (NIH), National
Oceanic and Atmospheric Administration (NOAA), Nature Publishing Group, Project MUSE,
PubMed, RePEc (Research Papers in Economics), Sage, Springer, Taylor & Francis,
University of Chicago Press, and Wiley, among others. Although Google Scholar does not
cover material from all major publishers (e.g., American Chemical Society and Elsevier), it
contains citations to articles from ACS and Elsevier when documents from other sources
cite these articles.
Table 1. Items Used in the Study
Mostafa Nisonger
Document Type
Journal articles 11 28
Conference papers 22 6
Reports 0 15
Bibliographies 0 5
e-Journal articles 3 0
Review Articles 1 3
Books 1 2
Chapters in Books 0 3
Other 2 0
Total 40 62
Refereed Status
Refereed 16 28
Not Refereed 18 27
Not Applicable 6 7
Total 40 62
Publication Year
Pre-1986 0 8
1986-1990 0 8
1991-1995 2 16
1996-1997 10 7
1998-1999 2 8
2000-2001 7 9
2002-2003 16 3
2004-2005 3 3
Total 47 62
Units of Analysis
To compare citations found in Scopus and Google Scholar with those found in Web of
Science, and determine differences between them in terms of citation counts as well as the
source of the citations, their type (e.g., journal article, conference paper), and refereed
status, we used the publication lists of two colleagues from the School of Library and
Information Science at Indiana University, namely Javed Mostafa and Thomas E. Nisonger.
We selected Mostafa and Nisonger because they both are highly published and cited
authors and work on considerably different Library and Information Science (LIS) research
areas: Mostafa in the areas of intelligent interfaces for information retrieval and filtering,
knowledge discovery, user modeling, and personalized delivery of information, and
Nisonger in the areas of collection management and evaluation, bibliometrics, and serials.
As shown below, this wide variety of research areas provided a valuable framework to
make comparisons between Scopus, Google Scholar, and Web of Science. Table 1 shows
detailed information about the items used in this study.
Google Scholar
can be searched for citations to an individual item or author in two different ways:
Author
search: this retrieves items published by the author in question and ranks these items
by citation counts. The searcher will need to click on the “Cited by . . .” link to view the
documents that cite each item. In cases where an author name is very common,
additional keywords (e.g., journal name or keywords in title) may be necessary to use
to increase precision. Also may be needed is searching under variations of the author
name to account for all name changes and/or citing styles, such as last-name,
first-name last-name, and first-name middle-initial last-name. All these variations of
the author name can be ORed in the same search statement with each phrase placed
between quotation marks. In cases where an accurate author search is not possible,
a title search is recommended (albeit more tedious), especially when an author has
published tens or hundreds of papers.
Title
search: this uses the title of each item (e.g., journal article, book, book chapter, or
conference paper) published by the author in question. The result will be a list of all
the documents that cite the item. In cases where the title is too short or ambiguous to
refer to only the item in question, the searcher has to use additional information as
keywords ANDed with the title search string to narrow the result set to the most
relevant records. These additional keywords could include the author’s last name,
journal name, book or conference title, publisher name, or a combination of these
keywords.
A major disadvantage of Google Scholar is that its records are retrieved in a way that is
very impractical for use with large sets, requiring a very tedious process of manually
cleaning, organizing, and classifying the information into meaningful and useable formats.
Unlike Scopus and Web of Science, Google Scholar does not allow re-sorting of the
retrieved set in any way (such as by date, author name, or data source); retrieved sets are
usually rank ordered by number of citations. The result sets show short entries, displaying
the title of the cited article and the name of the author(s); entries which include the link
[Cited by . . .] indicate the number of times the article has been cited. Clicking on the link
will take users to the list of citing articles. Other disadvantages of Google Scholar include
duplicate citations-e.g., a citation published in two different forms, such as preprint and
journal article, will be counted as two citations). In many cases, the item for which citations
are sought for is retrieved and considered a citation.
CiteSearch
allowed us to automatically: (1) conduct both author and title searches at the same time; (2)
retrieve and merge results from both types of searches; (3) remove duplicate records; and
(4) export results directly into a spreadsheet while parsing data into identifiable fields (e.g.,
author, title, journal name, and year of publication). Although all searches were done
automatically, the results for each search were examined twice by a research assistant and
twice again by one of the authors (Meho) to guarantee high precision and recall.
Comparisons between all four sets were made and all errors with the data and the retrieval
system were corrected. To generate accurate Web of Science and Scopus citation data, we
conducted searches for each item published by the two faculty members. We also
conducted cited author searches to enhance recall.
All data collected were entered into an Excel file where items were coded by document type
(e.g., journal articles, review articles, and conference papers) and refereed status of both
the cited and citing item(s), as well as where the item was cited (in which book, article,
chapter, and so on) and what source was used to identify the citation. The refereed status
of the citations found through Google Scholar exclusively was determined through Ulrich’s
International Periodicals Directory as well as relying on the knowledge domain of the
authors.
Limitations
Although the number and type of records used in this study are larger and more diverse
than those used in similar published studies (e.g., Bauer & Bakkalbasi, 2005; Jacso,
2005a), the primary limitation of the study is still the small size of the sample examined.
Despite these limitations, the study contributes significantly to research, especially because
it is the first to show empirically how the use of multiple sources provides a more
comprehensive picture of an author’s research impact. The study also generates several
important questions for future research (see below). CiteSearch, the search system
developed and used here, should also be very valuable to researchers interested in citation
analysis and bibliometric studies.
In this section, two topics are discussed: a comparative analysis of all three databases and
an analysis of the value and quality of citations found through Google Scholar. For the first
topic, only two sets of citations from Google Scholar are used in the analysis here: (1)
citations that overlapped with Scopus and/or Web of Science; and (2) citations found in
refereed journal articles. This decision was made to make accurate and fair comparisons
between the three databases. As mentioned earlier, both Scopus and Web of Science
index primarily refereed journals articles whereas Google Scholar indexes several refereed
and non-refereed types of documents in addition to journal articles. For the second topic, all
citations found through Google Scholar are analyzed to discern their overall value and
quality. Before discussing the results, it should be emphasized that the content of all three
databases are updated very frequently; therefore, the numbers reported here will change
by the time of publication of this paper.
As far as citation counts are concerned, results show that coverage in the three databases
is highly dependent on the subject matter of the faculty member. For example, in Mostafa’s
case (whose research focus is in the areas of intelligent interfaces for information retrieval
and filtering, knowledge discovery, user modeling, and personalized delivery of
information), all three databases retrieve relatively the same number of citations, whereas
in Nisonger’s case (whose research is in the areas of collection management and
evaluation, bibliometrics, and serials), all three databases retrieve significantly different
results. In Nisonger’s case, Web of Science retrieves almost twice as much citations as
both Scopus and Google Scholar.
Table 2 also shows that when all three databases are used to locate citations to an author’s
work, the number of citations increases significantly in comparison to using only one
database. More detail on this is presented in Table 3 which indicates the difference it
makes when broadening the citation sources beyond Web of Science. As in straight counts,
the impact of multi-sourcing of citations is highly dependent on the research area(s) of an
author. In the cases of our two samples, the use of Web of Science and Scopus together
increases Mostafa’s citations by 31.1% and that of Nisonger by 8.7%; the combination of
Web of Science and Google Scholar increases their citations by 25.4% and 19.7%,
respectively. The use of all three databases together increases the number of citations in
scholarly journals by 39.3% in Mostafa’s case and 24.3% in Nisonger’s case.
If we assume that Mostafa is a representative of the Information Science field and Nisonger
of Library Science, then one could conclude that: (1) Scopus is much more useful for
Information Science than it is for Library Science in identifying citations not found in Web of
Science; (2) Web of Science is indispensable for both Information Science and Library
Science; and (3) Google Scholar is useful for both fields in locating citations not found in
Web of Science.
Table 4 further confirms these conclusions in that it shows an inverse relationship between
unique and overlapped citations found in any two databases. Table 5 too confirms the
conclusions made in that it shows a significantly higher percentage of unique items in Web
of Science for Nisonger than for Mostafa and vice versa for Scopus.
Table 4. Citation Overlap Among Databases
Source(s) Mostafa Nisonger
citations % overlap citations % overlap
Web of Science (WoS) + Scopus 160 51.3 188 44.7
WoS + Google Scholar 153 57.5 207 23.7
WoS + Google Scholar + Scopus 170 36.5 215 16.3
Scopus + Google Scholar 156 53.2 140 70.0
Gardner and Eng (2005) examined the top 100 retrieved records in Google Scholar on the
topic of home schooling and found the following breakdown: 40 journal articles (32 of them
peer-reviewed), 16 books, 15 magazines, seven dissertations, six ERIC documents, five
newspaper articles, three Web sites, two conference papers, and one monograph,
newsletter, and government document. In this study, we found relatively similar results (see
Table 6).
Of the 247 citations found for Mostafa in Google Scholar, 119 (or 48.2%) were refereed
items. As for Nisonger 83 (or 74.8%) of his citations in Google Scholar were in refereed
items. This suggests that citations found through Google Scholar are more likely to be in
refereed journals in Library Science than is the case in Information Science where almost
two thirds of the citations originate from either conference papers or non-refereed materials.
The current authors are examining a much larger and representative sample to verify these
results.
This study provides direct and meaningful implications for faculty members who need
assistance in compiling their own citation records and also for use as a general reference
tool (e.g., for locating citations to a particular paper or book). The study informs reference
and other information specialists of novel ways of identifying citations to an author, paper,
or journal. Until very recently, ISI citation databases were essentially the only practical
sources for locating these references and citations. This study showed that other practical
methods and sources, such as Scopus and Google Scholar, can be used to locate citations
not covered by ISI. Significantly, this study showed that:
1. Web of Science should not be used alone for locating citations to an author or title.
2. Scopus and Google Scholar can help identify a considerable number of valuable
citations not found in Web of Science;
3. Scopus and Google Scholar can help identify a considerable number of citations in
document types not covered by ISI citation databases;
4. Scopus and Google Scholar
may assist in providing a more comprehensive picture of the extent of international
and interdisciplinary nature of scholarly communication of and among researchers;
and
5. Google Scholar
has several technical problems that users should be aware of in order to accurately
and effectively locate citations.
6. selection of the database(s) for locating citation is field-dependent.
This study, furthermore, has significant implications on the wider scholarly community as
researchers start to adopt the search method used here and CiteSearch that was
developed as part of the study to identify citation sources in such fields as business,
economics, history, law, medicine, political science, psychology, and sociology.
Given the continuous advances in information technology and improvement in online
access to tens of millions of records through databases and services that provide citation
information, future studies should explore:
Other sources and searching methods that can and should be used to locate citations
not covered by ISI citation databases, Scopus, or Google Scholar.
Differences that these sources could make in citation counts and citation traits for
authors, papers, and journals.
Whether broader sourcing of citations can alter one’s relative ranking vis-à-vis others
and, if so, how.
Which sources of citations provide better coverage of certain subject disciplines than
others.
References
Aksnes, D.W., & Taxt, R.E. (2004) Peer reviews and bibliometric indicators: A
comparative study at a Norwegian university Research Evaluation 13(1), 33-41
Bauer, K., & Bakkalbasi, N. (2005) An Examination of Citation Counts in a New Scholarly
Communication Environment D-Lib Magazine 11(9). Retrieved Januray 25, 2006, from
http://www.dlib.org/dlib/september05/bauer/09bauer.html
Borgman, C.L., & Furner, J. (2002) Scholarly communication and bibliometrics Annual
Review of Information Science and Technology 36, 3-72
Budd, J.M. (2000) Scholarly productivity of U.S. LIS faculty: an update The Library
Quarterly 70(2), 230-245
Cronin, B. (1984) The citation process: The role and significance of citations in scientific
communication London: Taylor Graham
Cronin, B., Snyder, H.; & Atkins, H. (1997) Comparative citation rankings of authors in
monographic and journal literature: A study of sociology Journal of Documentation 53(3),
263-273
Funkhouser, E.T. (1996) The evaluative use of citation analysis for communications
journals Human Communication Research 22(4), 563-574
Gardner, S., & Eng, S. (2005) Gaga over Google? Scholar in the Social Sciences Library
Hi Tech News 22(8), 42-45
Goodman, D. & Deis, L. (2005) Web of Science (2004 version) and Scopus The
Charleston Advisor 6(3). Retrieved January 25, 2006, from
http://www.charlestonco.com/dnloads/v6n3.pdf
Goodrum, A.A., McCain, K.W., Lawrence, S., & Giles, C.L. (2001) Scholarly publishing in
the Internet age: A citation analysis of computer science literature Information Processing
& Management 37(5), 661-675
Holmes, A., & Oppenheim, C. (2001) Use of citation analysis to predict the outcome of
the 2001 Research Assessment Exercise for Unit of Assessment (UoA) 61: Library and
Information Management Information Research 6(2). Retrieved June 15, 2005, from
http://informationr.net/ir/6-2/paper103.html
Jacso, P. (2005b) Google Scholar: the pros and the cons Online Information
Review 29(2), 208-214
MacRoberts, M.H., & MacRoberts, B.R. (1989) Problems of citation analysis: A critical
review Journal of the American Society for Information Science 40(5), 342-349
Martin, B.R. (1996) The use of multiple indicators in the assessment of basic
research Scientometrics 36(3), 343-362
Meho, L.I., & Spurgin, K.M. (in press) Ranking the research productivity of lis faculty and
schools: An evaluation of data sources and research methods Journal of the American
Society for Information Science and Technology 56
Notess, G.R. (2005) Scholarly web searching: Google Scholar and Scirus Online 29(4),
39-41
Reed, K.L. (1995) Citation analysis of faculty publications: Beyond Science Citation Index
and Social Science [sic] Citation Index Bulletin of the Medical Library Association 83(4):
503-508
Schloegl, C., & Stock, W.G. (2004) Impact and relevance of LIS journals: A scientometric
analysis of international and German-language LIS journals-Citation analysis versus
reader survey Journal of the American Society for Information Science and
Technology 55(13), 1155-1168
Seglen, P.O. (1998) Citation rates and journal impact factors are not suitable for
evaluation of research Acta Orthopaedica Scandinavica 69(3), 224-229
So, C.Y.K. (1998) Citation ranking versus expert judgment in evaluating communication
scholars: Effects of research specialty size and individual prominence Scientometrics 41(3),
325-333
Thomas, P.R., & Watkins, D.S. (1998) Institutional research rankings via bibliometric
analysis and direct peer review: A comparative case study with policy
implications Scientometrics 41(3), 335-355
Thomson Corporation. (2006) Web of Science 7.0 Retrieved June 15, 2005, from
http://scientific.thomson.com/support/products/wos7/
Van Hooydonk, G., & Milis-Proost, G. (1998) Measuring impact by a full option method
and the notion of bibliometric spectra Scientometrics 41(2), 169-183
van Raan, A.F.J. (2000) The pandora’s box of citation analysis: Measuring scientific
excellence-the last evil? In B. Cronin and H.B. Atkins (Eds.) The Web of knowledge: A
festschrift in honor of Eugene Garfield Medford, NJ: Information Today. pp. 301-319
van Raan, A.F.J. (2005) Fatal attraction: Conceptual and methodological problems in the
ranking of universities by bibliometric methods Scientometrics 62(1), 133-143
Wallin, Johan A. (n.d.) Bibliometric methods: Pitfalls and Possibilities Basic & Clinical
Pharmacology & Toxicology 97(5), 261-275
Wleklinski, J.M. (2005) Studying Google Scholar: wall to wall coverage? Online 29(3),
22-26
Zhao, D.Z., & Logan, E. (2002) Citation analysis using scientific publications on the web
as data source: A case study in the XML research area Scientometrics 54(3), 449-472