Exploratory and directed search strategies at a social science data archive

Sara Lafia; A.J. Million; Libby Hemphill

doi:10.29173/iq1087

Authors

Sara Lafia ICPSR, University of Michigan
A.J. Million ICPSR, University of Michigan https://orcid.org/0000-0002-8909-153X
Libby Hemphill ICPSR and UMSI, University of Michigan

DOI:

https://doi.org/10.29173/iq1087

Keywords:

research data, information search, query log analysis, user behavior, web analytics

Abstract

Researchers need to be able to find, access, and use data to participate in open science. To understand how users search for research data, we analyzed textual queries issued at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). We collected unique user queries from 988,475 user search sessions over four years (2012-16). Overall, we found that only 30% of site visitors entered search terms into the ICPSR website. We analyzed search strategies within these sessions by extending existing dataset search taxonomies to classify a subset of the 1,554 most popular queries. We identified five categories of commonly-issued queries: keyword-based (e.g., date, place, topic); name (e.g., study, series); identifier (e.g., study, series); author (e.g., institutional, individual); and type (e.g., file, format). While the dominant search strategy used short keywords to explore topics, directed searches for known items using study and series names were also common. We further distinguished exploratory browsing from directed search queries based on their page views, refinements, search depth, duration, and length. Directed queries were longer (i.e., they had more words), while sessions with exploratory queries had more refinements and associated page views. By comparing search interactions at ICPSR to other natural language interactions in similar web search contexts, we conclude that dataset search at ICPSR is underutilized. We envision how alternative search paradigms, such as those enabled by recommender systems, can enhance dataset search.

References

Abebe, R., Hill, S., Vaughan, J. W., Small, P. M., & Andrew Schwartz, H. (2018). Using Search Queries to Understand Health Information Needs in Africa. In arXiv [cs.CY]. arXiv. https://doi.org/10.48550/arXiv.1806.05740

Akmon, D., Lafia, S., Thomer, A., Hemphill, L., Pienta, A., Yakel, E., Bleckley, D., & Tyler, A. (2020). Measuring and Improving the Efficacy of Curation Activities in Data Archives. https://hdl.handle.net/2027.42/163501

Aula, A., Jhaveri, N., & Käki, M. (2005). Information search and re-access strategies of experienced web users. Proceedings of the 14th International Conference on World Wide Web, 583–592. https://dl.acm.org/doi/10.1145/1060745.1060831

Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology Behind Search. Addison Wesley: Edinburgh.

Bates, M.J. (1989), "The design of browsing and berrypicking techniques for the online search interface", Online Review, Vol. 13 No. 5, pp. 407-424. https://doi.org/10.1108/eb024320

Bendersky, M., & Croft, W. B. (2009). Analysis of long queries in a large scale search log. Proceedings of the 2009 Workshop on Web Search Click Data, 8–14. https://doi.org/10.1145/1507509.1507511

Börner, K., Chen, C., & Boyack, K. W. (2003). Visualizing knowledge domains. Annual Review of Information Science and Technology, 37(1), 179–255. https://doi.org/10.1002/aris.1440370106

Brickley, D., Burgess, M., & Noy, N. (2019). Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. The World Wide Web Conference on - WWW ’19, 1365–1375. https://doi.org/10.1145/3308558.3313685

Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.

https://doi.org/10.1145/792550.792552

Buckland, M. K. (1979). On types of search and the allocation of library resources. Journal of the American Society for Information Science. American Society for Information Science, 30(3), 143–147. https://doi.org/10.1002/asi.4630300305

Carevic, Z., Roy, D., & Mayr, P. (2020). Characteristics of Dataset Retrieval Sessions: Experiences from a Real-Life Digital Library. Digital Libraries for Open Knowledge, 185–193. https://doi.org/10.1007/978-3-030-54956-5_14

Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ibáñez, L.-D., Kacprzak, E., & Groth, P. (2019). Dataset search: a survey. The VLDB Journal: Very Large Data Bases: A Publication of the VLDB Endowment. https://doi.org/10.1007/s00778-019-00564-x

Degbelo, A. (2020). Open Data User Needs: A Preliminary Synthesis. Companion Proceedings of the Web Conference 2020, 834–839. https://doi.org/10.1145/3366424.3386586

Eickhoff, C., Teevan, J., White, R., & Dumais, S. (2014). Lessons from the journey: a query log analysis of within-session learning. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, 223–232. https://doi.org/10.1145/2556195.2556217

Faniel, I. M., Frank, R. D., & Yakel, E. (2019). Context from the data reuser’s point of view. Journal of Documentation, 75(6), 1274–1297. https://doi.org/10.1108/JD-08-2018-0133

Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 30(11), 964–971. https://doi.org/10.1145/32206.32212

Google Analytics. (2023). https://support.google.com/analytics/

Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines. Journal of the Association for Information Science and Technology, 70(5), 419–432. https://doi.org/10.1002/asi.24165

Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. (2020). Lost or Found? Discovering Data Needed for Research. Harvard Data Science Review.

Hearst, M. (2006). Design recommendations for hierarchical faceted search interfaces. ACM SIGIR Workshop on Faceted Search, 1–5. https://flamenco.berkeley.edu/papers/faceted-workshop06.pdf

Hearst, M. (2009). Search User Interfaces. Cambridge University Press.

He, L., & Han, Z. (2017). Do usage counts of scientific data make sense? An investigation of the Dryad repository. Library Hi Tech, 35(2), 332–342. https://doi.org/10.1108/LHT-12-2016-0158

Hembrooke, H. A., Granka, L. A., & Gay, G. K. (2005). The effects of expertise and feedback on search term selection and subsequent learning. Journal of the American Society for Information Science and Technology. https://doi.org/10.1002/asi.20180

Hemphill, L., Pienta, A., Lafia, S., Akmon, D., & Bleckley, D. (2021). How do properties of data, their curation, and their funding relate to reuse? Journal of the American Society for Information Science and Technology, 73(10), 1432–1444. https://doi.org/10.1002/asi.24646

Herskovic, J. R., Tanaka, L. Y., Hersh, W., & Bernstam, E. V. (2007). A day in the life of PubMed: analysis of a typical day’s query log. Journal of the American Medical Informatics Association: JAMIA, 14(2), 212–220. https://doi.org/10.1197/jamia.M2191

ICPSR Thesaurus. (2023). https://www.icpsr.umich.edu/web/ICPSR/thesaurus

Institute of Medicine, National Academy of Engineering, National Academy of Sciences, Committee on Science, Engineering, and Public Policy, & Committee on Facilitating Interdisciplinary Research. (2005). Facilitating Interdisciplinary Research. National Academies Press.

Jansen, B. J., & Spink, A. (2006). How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management, 42(1), 248–263. https://doi.org/10.1016/j.ipm.2004.10.007

Jiang, D., Pei, J., & Li, H. (2013). Mining search and browse logs for web search: A Survey. ACM Trans. Intell. Syst. Technol., 4(4), 1–37. https://doi.org/10.1145/2508037.2508038

Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 133–142. https://doi.org/10.1145/775047.775067

Jones, S., Cunningham, S. J., McNab, R., & Boddie, S. (2000). A transaction log analysis of a digital library. International Journal on Digital Libraries, 3(2), 152–169. https://doi.org/10.1007/s007999900022

Kacprzak, E., Koesten, L. M., Ibáñez, L.-D., Simperl, E., & Tennison, J. (2017). A query log analysis of dataset search. In Lecture Notes in Computer Science (pp. 429–436). Springer International Publishing. https://doi.org/10.1007/978-3-319-60131-1_29

Kathuria, A., Jansen, B. J., Hafernik, C., & Spink, A. (2010). Classifying the user intent of web queries using k‐means clustering. Internet Research, 20(5), 563–581. https://doi.org/10.1108/10662241011084112

Kumar, R., & Tomkins, A. (2010). A characterization of online browsing behavior. Proceedings of the 19th International Conference on World Wide Web, 561–570. https://doi.org/10.1145/1772690.1772748

Lafia, S., Million, A. J., & Hemphill, L. (2023). Direct, Orienting, and Scenic Paths: How Users Navigate Search in a Research Data Archive. Proceedings of the ACM on Human Information Interaction and Retrieval (CHIIR).

Levenstein, M. C., & Lyle, J. A. (2018). Data: Sharing Is Caring. Advances in Methods and Practices in Psychological Science, 1(1), 95–103.

Marchionini, G. (1997). Information Seeking in Electronic Environments. Cambridge University Press.

Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications of the ACM, 49(4), 41. https://doi.org/10.1145/1121949.1121979

Meho, L. I., & Tibbo, H. R. (2003). Modeling the information‐seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology. https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.10244

National Research Council, Division of Behavioral and Social Sciences and Education, Commission on Behavioral and Social Sciences and Education, & Committee on National Statistics. (1985). Sharing Research Data. National Academies Press.

Papenmeier, A., Krämer, T., Friedrich, T., Hienert, D., & Kern, D. (2021). Genuine information needs of social scientists looking for data. Proceedings of the Association for Information Science and Technology, 58(1), 292–302. https://doi.org/10.1002/pra2.457

Pienta, A. M., Akmon, D., Noble, J., Hoelter, L., & Jekielek, S. (2018). A Data-Driven Approach to Appraisal and Selection at a Domain Data Repository. International Journal of Digital Curation, 12(2). https://doi.org/10.2218/ijdc.v12i2.500

Renspie, M., Shepard, L., & Childress, E. (2015). Making Archival and Special Collections More Accessible. OCLC Research.

Sharifpour, R., Wu, M., & Zhang, X. (2022). Large-scale analysis of query logs to profile users for dataset search. Journal of Documentation, 79(1), 66–85. https://doi.org/10.1108/JD-12-2021-0245

Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33(1), 6–12. https://doi.org/10.1145/331403.331405

Solomon, P. (2002). Discovering information in context. Annual Review of Information Science and Technology, 36(1), 229–264. https://doi.org/10.1002/aris.1440360106

Sun, G., & Khoo, C. S. G. (2017). Social science research data curation: issues of reuse. Libellarium: Journal for the Research of Writing, Books, and Cultural Heritage Institutions, 9(2). https://doi.org/10.15291/libellarium.v9i2.291

Taghavi, M., Patel, A., Schmidt, N., Wills, C., & Tew, Y. (2012). An analysis of web proxy logs with query distribution pattern approach for search engines. Computer Standards & Interfaces, 34(1), 162–170. https://doi.org/10.1016/j.csi.2011.07.001

Wang, X., Duan, Q., & Liang, M. (2021). Understanding the process of data reuse: An extensive review. Journal of the Association for Information Science and Technology, 72(9), 1161–1182. https://doi.org/10.1002/asi.24483

White, H. D., Lin, X., Buzydlowski, J. W., & Chen, C. (2004). User-controlled mapping of significant literatures. Proceedings of the National Academy of Sciences, 101(Supplement 1), 5297–5302. https://doi.org/10.1073/pnas.0307630100

White, R. W. (2016). Exploration, Complexity, and Discovery. In Interactions with Search Systems (pp. 201–230). Cambridge University Press. https://doi.org/10.1017/CBO9781139525305.009

White, R. W., & Roth, R. A. (2009). Exploratory search: Beyond the query-response paradigm. Synthesis Lectures on Information Concepts Retrieval and Services, 1(1), 1–98. https://doi.org/10.2200/s00174ed1v01y200901icr003

Wilson, M. L., Schraefel, M. C., & White, R. W. (2009). Evaluating advanced search interfaces using established information-seeking models. Journal of the American Society for Information Science and Technology, 60(7), 1407–1422. https://doi.org/10.1002/asi.21080

Wu, M., Psomopoulos, F., Khalsa, S. J., & de Waard, A. (2019). Data discovery paradigms: User requirements and recommendations for data repositories. Data Science Journal, 18. https://doi.org/10.5334/dsj-2019-003

Zhang, G., Wang, J., Liu, J., & Pan, Y. (2021). Relationship between the metadata and relevance criteria of scientific data. Data Science Journal, 20(1), 5. https://doi.org/10.5334/dsj-2021-005

Exploratory and directed search strategies at a social science data archive

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

doajseal

about

cclicense

Information

Current Issue

Make a Submission