Exploratory and directed search strategies at a social science data archive
DOI:
https://doi.org/10.29173/iq1087Keywords:
research data, information search, query log analysis, user behavior, web analyticsAbstract
Researchers need to be able to find, access, and use data to participate in open science. To understand how users search for research data, we analyzed textual queries issued at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). We collected unique user queries from 988,475 user search sessions over four years (2012-16). Overall, we found that only 30% of site visitors entered search terms into the ICPSR website. We analyzed search strategies within these sessions by extending existing dataset search taxonomies to classify a subset of the 1,554 most popular queries. We identified five categories of commonly-issued queries: keyword-based (e.g., date, place, topic); name (e.g., study, series); identifier (e.g., study, series); author (e.g., institutional, individual); and type (e.g., file, format). While the dominant search strategy used short keywords to explore topics, directed searches for known items using study and series names were also common. We further distinguished exploratory browsing from directed search queries based on their page views, refinements, search depth, duration, and length. Directed queries were longer (i.e., they had more words), while sessions with exploratory queries had more refinements and associated page views. By comparing search interactions at ICPSR to other natural language interactions in similar web search contexts, we conclude that dataset search at ICPSR is underutilized. We envision how alternative search paradigms, such as those enabled by recommender systems, can enhance dataset search.
References
Abebe, R., Hill, S., Vaughan, J. W., Small, P. M., & Andrew Schwartz, H. (2018). Using Search Queries to Understand Health Information Needs in Africa. In arXiv [cs.CY]. arXiv. https://doi.org/10.48550/arXiv.1806.05740
Akmon, D., Lafia, S., Thomer, A., Hemphill, L., Pienta, A., Yakel, E., Bleckley, D., & Tyler, A. (2020). Measuring and Improving the Efficacy of Curation Activities in Data Archives. https://hdl.handle.net/2027.42/163501
Aula, A., Jhaveri, N., & Käki, M. (2005). Information search and re-access strategies of experienced web users. Proceedings of the 14th International Conference on World Wide Web, 583–592. https://dl.acm.org/doi/10.1145/1060745.1060831
Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology Behind Search. Addison Wesley: Edinburgh.
Bates, M.J. (1989), "The design of browsing and berrypicking techniques for the online search interface", Online Review, Vol. 13 No. 5, pp. 407-424. https://doi.org/10.1108/eb024320
Bendersky, M., & Croft, W. B. (2009). Analysis of long queries in a large scale search log. Proceedings of the 2009 Workshop on Web Search Click Data, 8–14. https://doi.org/10.1145/1507509.1507511
Börner, K., Chen, C., & Boyack, K. W. (2003). Visualizing knowledge domains. Annual Review of Information Science and Technology, 37(1), 179–255. https://doi.org/10.1002/aris.1440370106
Brickley, D., Burgess, M., & Noy, N. (2019). Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. The World Wide Web Conference on - WWW ’19, 1365–1375. https://doi.org/10.1145/3308558.3313685
Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.
https://doi.org/10.1145/792550.792552
Buckland, M. K. (1979). On types of search and the allocation of library resources. Journal of the American Society for Information Science. American Society for Information Science, 30(3), 143–147. https://doi.org/10.1002/asi.4630300305
Carevic, Z., Roy, D., & Mayr, P. (2020). Characteristics of Dataset Retrieval Sessions: Experiences from a Real-Life Digital Library. Digital Libraries for Open Knowledge, 185–193. https://doi.org/10.1007/978-3-030-54956-5_14
Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ibáñez, L.-D., Kacprzak, E., & Groth, P. (2019). Dataset search: a survey. The VLDB Journal: Very Large Data Bases: A Publication of the VLDB Endowment. https://doi.org/10.1007/s00778-019-00564-x
Degbelo, A. (2020). Open Data User Needs: A Preliminary Synthesis. Companion Proceedings of the Web Conference 2020, 834–839. https://doi.org/10.1145/3366424.3386586
Eickhoff, C., Teevan, J., White, R., & Dumais, S. (2014). Lessons from the journey: a query log analysis of within-session learning. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, 223–232. https://doi.org/10.1145/2556195.2556217
Faniel, I. M., Frank, R. D., & Yakel, E. (2019). Context from the data reuser’s point of view. Journal of Documentation, 75(6), 1274–1297. https://doi.org/10.1108/JD-08-2018-0133
Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 30(11), 964–971. https://doi.org/10.1145/32206.32212
Google Analytics. (2023). https://support.google.com/analytics/
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines. Journal of the Association for Information Science and Technology, 70(5), 419–432. https://doi.org/10.1002/asi.24165
Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. (2020). Lost or Found? Discovering Data Needed for Research. Harvard Data Science Review.
Hearst, M. (2006). Design recommendations for hierarchical faceted search interfaces. ACM SIGIR Workshop on Faceted Search, 1–5. https://flamenco.berkeley.edu/papers/faceted-workshop06.pdf
Hearst, M. (2009). Search User Interfaces. Cambridge University Press.
He, L., & Han, Z. (2017). Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech, 35(2), 332–342. https://doi.org/10.1108/LHT-12-2016-0158
Hembrooke, H. A., Granka, L. A., & Gay, G. K. (2005). The effects of expertise and feedback on search term selection and subsequent learning. Journal of the American Society for Information Science and Technology. https://doi.org/10.1002/asi.20180
Hemphill, L., Pienta, A., Lafia, S., Akmon, D., & Bleckley, D. (2021). How do properties of data, their curation, and their funding relate to reuse? Journal of the American Society for Information Science and Technology, 73(10), 1432–1444. https://doi.org/10.1002/asi.24646
Herskovic, J. R., Tanaka, L. Y., Hersh, W., & Bernstam, E. V. (2007). A day in the life of PubMed: analysis of a typical day’s query log. Journal of the American Medical Informatics Association: JAMIA, 14(2), 212–220. https://doi.org/10.1197/jamia.M2191
ICPSR Thesaurus. (2023). https://www.icpsr.umich.edu/web/ICPSR/thesaurus
Institute of Medicine, National Academy of Engineering, National Academy of Sciences, Committee on Science, Engineering, and Public Policy, & Committee on Facilitating Interdisciplinary Research. (2005). Facilitating Interdisciplinary Research. National Academies Press.
Jansen, B. J., & Spink, A. (2006). How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management, 42(1), 248–263. https://doi.org/10.1016/j.ipm.2004.10.007
Jiang, D., Pei, J., & Li, H. (2013). Mining search and browse logs for web search: A Survey. ACM Trans. Intell. Syst. Technol., 4(4), 1–37. https://doi.org/10.1145/2508037.2508038
Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 133–142. https://doi.org/10.1145/775047.775067
Jones, S., Cunningham, S. J., McNab, R., & Boddie, S. (2000). A transaction log analysis of a digital library. International Journal on Digital Libraries, 3(2), 152–169. https://doi.org/10.1007/s007999900022
Kacprzak, E., Koesten, L. M., Ibáñez, L.-D., Simperl, E., & Tennison, J. (2017). A query log analysis of dataset search. In Lecture Notes in Computer Science (pp. 429–436). Springer International Publishing. https://doi.org/10.1007/978-3-319-60131-1_29
Kathuria, A., Jansen, B. J., Hafernik, C., & Spink, A. (2010). Classifying the user intent of web queries using k‐means clustering. Internet Research, 20(5), 563–581. https://doi.org/10.1108/10662241011084112
Kumar, R., & Tomkins, A. (2010). A characterization of online browsing behavior. Proceedings of the 19th International Conference on World Wide Web, 561–570. https://doi.org/10.1145/1772690.1772748
Lafia, S., Million, A. J., & Hemphill, L. (2023). Direct, Orienting, and Scenic Paths: How Users Navigate Search in a Research Data Archive. Proceedings of the ACM on Human Information Interaction and Retrieval (CHIIR).
Levenstein, M. C., & Lyle, J. A. (2018). Data: Sharing Is Caring. Advances in Methods and Practices in Psychological Science, 1(1), 95–103.
Marchionini, G. (1997). Information Seeking in Electronic Environments. Cambridge University Press.
Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications of the ACM, 49(4), 41. https://doi.org/10.1145/1121949.1121979
Meho, L. I., & Tibbo, H. R. (2003). Modeling the information‐seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology. https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.10244
National Research Council, Division of Behavioral and Social Sciences and Education, Commission on Behavioral and Social Sciences and Education, & Committee on National Statistics. (1985). Sharing Research Data. National Academies Press.
Papenmeier, A., Krämer, T., Friedrich, T., Hienert, D., & Kern, D. (2021). Genuine information needs of social scientists looking for data. Proceedings of the Association for Information Science and Technology, 58(1), 292–302. https://doi.org/10.1002/pra2.457
Pienta, A. M., Akmon, D., Noble, J., Hoelter, L., & Jekielek, S. (2018). A Data-Driven Approach to Appraisal and Selection at a Domain Data Repository. International Journal of Digital Curation, 12(2). https://doi.org/10.2218/ijdc.v12i2.500
Renspie, M., Shepard, L., & Childress, E. (2015). Making Archival and Special Collections More Accessible. OCLC Research.
Sharifpour, R., Wu, M., & Zhang, X. (2022). Large-scale analysis of query logs to profile users for dataset search. Journal of Documentation, 79(1), 66–85. https://doi.org/10.1108/JD-12-2021-0245
Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33(1), 6–12. https://doi.org/10.1145/331403.331405
Solomon, P. (2002). Discovering information in context. Annual Review of Information Science and Technology, 36(1), 229–264. https://doi.org/10.1002/aris.1440360106
Sun, G., & Khoo, C. S. G. (2017). Social science research data curation: issues of reuse. Libellarium: Journal for the Research of Writing, Books, and Cultural Heritage Institutions, 9(2). https://doi.org/10.15291/libellarium.v9i2.291
Taghavi, M., Patel, A., Schmidt, N., Wills, C., & Tew, Y. (2012). An analysis of web proxy logs with query distribution pattern approach for search engines. Computer Standards & Interfaces, 34(1), 162–170. https://doi.org/10.1016/j.csi.2011.07.001
Wang, X., Duan, Q., & Liang, M. (2021). Understanding the process of data reuse: An extensive review. Journal of the Association for Information Science and Technology, 72(9), 1161–1182. https://doi.org/10.1002/asi.24483
White, H. D., Lin, X., Buzydlowski, J. W., & Chen, C. (2004). User-controlled mapping of significant literatures. Proceedings of the National Academy of Sciences, 101(Supplement 1), 5297–5302. https://doi.org/10.1073/pnas.0307630100
White, R. W. (2016). Exploration, Complexity, and Discovery. In Interactions with Search Systems (pp. 201–230). Cambridge University Press. https://doi.org/10.1017/CBO9781139525305.009
White, R. W., & Roth, R. A. (2009). Exploratory search: Beyond the query-response paradigm. Synthesis Lectures on Information Concepts Retrieval and Services, 1(1), 1–98. https://doi.org/10.2200/s00174ed1v01y200901icr003
Wilson, M. L., Schraefel, M. C., & White, R. W. (2009). Evaluating advanced search interfaces using established information-seeking models. Journal of the American Society for Information Science and Technology, 60(7), 1407–1422. https://doi.org/10.1002/asi.21080
Wu, M., Psomopoulos, F., Khalsa, S. J., & de Waard, A. (2019). Data discovery paradigms: User requirements and recommendations for data repositories. Data Science Journal, 18. https://doi.org/10.5334/dsj-2019-003
Zhang, G., Wang, J., Liu, J., & Pan, Y. (2021). Relationship between the metadata and relevance criteria of scientific data. Data Science Journal, 20(1), 5. https://doi.org/10.5334/dsj-2021-005
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Sara Lafia, A.J. Million, Libby Hemphill
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.
The Creative Commons-Attribution-Noncommercial License 4.0 International applies to all works published by IASSIST Quarterly. Authors will retain copyright of the work. Your contribution will be available at the IASSIST Quarterly website when announced on the IASSIST list server.