research-article

Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web Logs

Authors:

Hung-Hsuan ChenAuthors Info & Claims

ACM Journal of Data and Information Quality (JDIQ), Volume 14, Issue 2

Article No.: 11, Pages 1 - 17

https://doi.org/10.1145/3490392

Published: 23 March 2022 Publication History

Abstract

Web logs have been widely used to represent the web page visits of online users. However, we found that web logs in Chrome’s browsing history only record 57% of users’ visited websites, i.e., nearly half of a user’s website visits are not recorded. Additionally, 5.1% of the visits recorded in the web log occur because of unconscious user actions, i.e., these page visits are not initiated from users. We created a Google Chrome plugin and recruited users to install the plugin to collect and analyze the conscious URL visits, unconscious URL visits, and “missing” URL visits (i.e., the visits unrecorded in the traditional web log). We reported the statistics of these behaviors. We showed that sorting popular website categories based on traditional web logs differs from the rankings obtained when including missing visits or excluding unintentional visits. We predicted users’ future behaviors based on three types of training data – all the visits in modern web logs, the intentional visits in web logs, and the intentional visits plus missing visits in web logs. The experimental results indicate that missing visits in web logs may contain additional information, and unintentional visits in web logs may contain more noise than information for user modeling. Consequently, we need to be careful of the observations and conclusions derived from web log analyses because the web log data could be an incomplete and noisy dataset of a user’s visited web pages.

References

[1]

Guo-Jhen Bai, Cheng-You Lien, and Hung-Hsuan Chen. 2019. Co-learning multiple browsing tendencies of a user by matrix factorization-based multitask learning. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 253–257.

Digital Library

[2]

Ting Bai, Wanye Xin Zhao, Yulan He, Jian-Yun Nie, and Ji-Rong Wen. 2018. Characterizing and predicting early reviewers for effective product marketing on e-commerce websites. IEEE Transactions on Knowledge and Data Engineering 30, 12 (2018), 2271–2284.

Digital Library

[3]

Kenneth J. Berry, Janis E. Johnston, Sammy Zahran, and Paul W. Mielke. 2009. Stuart’s tau measure of effect size for ordinal variables: Some methodological considerations. Behavior research methods 41, 4 (2009), 1144–1148.

[4]

Rahul Bhagat, Srevatsan Muralidharan, Alex Lobzhanidze, and Shankar Vishwanath. 2018. Buy it again: Modeling repeat purchase recommendations. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[5]

Robin Burke. 2002. Hybrid recommender systems: Survey and experiments. User Modeling and User-adapted Interaction 12, 4 (2002), 331–370.

Digital Library

[6]

Damon Centola. 2010. The spread of behavior in an online social network experiment. Science 329, 5996 (2010), 1194–1197.

[7]

Hung-Hsuan Chen. 2017. Weighted-SVD: Matrix factorization with weights on the latent factors. arXiv preprint arXiv:1710.00482 (2017).

[8]

Hung-Hsuan Chen. 2018. Behavior2Vec: Generating distributed representations of users’ behaviors on products for recommender systems. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 4 (2018), 1–20.

Digital Library

[9]

Hung-Hsuan Chen and Pu Chen. 2019. Differentiating regularization weights–A simple mechanism to alleviate cold start in recommender systems. ACM Transactions on Knowledge Discovery from Data (TKDD) 13, 1 (2019), 1–22.

Digital Library

[10]

Hung-Hsuan Chen, Chu-An Chung, Hsin-Chien Huang, and Wen Tsui. 2017. Common pitfalls in training and evaluating recommender systems. ACM SIGKDD Explorations Newsletter 19, 1 (2017), 37–45.

Digital Library

[11]

Hung-Hsuan Chen, Liang Gou, Xiaolong Zhang, and Clyde Lee Giles. 2011. CollabSeer: A search engine for collaboration discovery. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. 231–240.

Digital Library

[12]

Hung-Hsuan Chen, Madian Khabsa, and C. Lee Giles. 2014. The feasibility of investing in manual correction of metadata for a large-scale digital library. In IEEE/ACM Joint Conference on Digital Libraries. IEEE, 225–228.

[13]

Hung-Hsuan Chen, Pucktada Treeratpituk, Prasenjit Mitra, and C. Lee Giles. 2013. CSSeer: An expert recommendation system based on CiteseerX. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. 381–382.

Digital Library

[14]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794.

Digital Library

[15]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 7–10.

Digital Library

[16]

Szu-Yu Chou, Jyh-Shing Roger Jang, and Yi-Hsuan Yang. 2018. Fast tensor factorization for large-scale context-aware recommendation from implicit feedback. IEEE Transactions on Big Data 6, 1 (2018), 201–208.

[17]

Wei Dai, Qing Zhang, Weike Pan, and Zhong Ming. 2020. Transfer to rank for Top-N recommendation. IEEE Trans. Big Data 6, 4 (2020), 770–779.

[18]

Susan Dumais, Robin Jeffries, Daniel M. Russell, Diane Tang, and Jaime Teevan. 2014. Understanding user behavior through log data and analysis. In Ways of Knowing in HCI. Springer, 349–372.

[19]

Gustav Theodor Fechner, Davis H. Howes, and Edwin Garrigues Boring. 1966. Elements of Psychophysics. Vol. 1. Holt, Rinehart and Winston New York.

[20]

Milan Gocic and Slavisa Trajkovic. 2013. Analysis of changes in meteorological variables using Mann-Kendall and Sen’s slope estimator statistical tests in Serbia. Global and Planetary Change 100 (2013), 172–182.

[21]

Liang Gou, Hung-Hsuan Chen, Jung-Hyun Kim, Xiaolong Zhang, and C. Lee Giles. 2010. SNDocRank: A social network-based video search ranking framework. In Proceedings of the International Conference on Multimedia Information Retrieval. 367–376.

Digital Library

[22]

Liang Gou, Xiaolong Zhang, Hung-Hsuan Chen, Jung-Hyun Kim, and C. Lee Giles. 2010. Social network document ranking. In Proceedings of the 10th Annual Joint Conference on Digital Libraries. 313–322.

Digital Library

[23]

Mihajlo Grbovic and Haibin Cheng. 2018. Real-time personalization using embeddings for search ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[24]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.

Digital Library

[25]

Li-Yuan Hsu, Chia-Hao Kao, I-Sheng Jheng, and Hung-Hsuan Chen. 2021. Toward building an academic search engine understanding the purposes of the matched sentences in an abstract. IEEE Access 9 (2021), 109344–109354.

[26]

Jeff Huang, Thomas Lin, and Ryen W. White. 2012. No search result left behind: Branching behavior with browser tabs. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. 203–212.

Digital Library

[27]

Jeff Huang and Ryen W. White. 2010. Parallel browsing behavior on the web. In Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. 13–18.

Digital Library

[28]

Rolf Jagerman, Krisztian Balog, and Maarten De Rijke. 2018. Opensearch: Lessons learned from an online evaluation campaign. Journal of Data and Information Quality (JDIQ) 10, 3 (2018), 1–15.

Digital Library

[29]

Harsh Jhamtani, Rishiraj Saha Roy, Niyati Chhaya, and Eric Nyberg. 2017. Leveraging site search logs to identify missing content on enterprise webpages. In European Conference on Information Retrieval. Springer, 506–512.

[30]

Di Jiang, Yongxin Tong, and Yuanfeng Song. 2016. Cross-lingual topic discovery from multilingual search engine query log. ACM Trans. Inf. Syst. 35, 2, Article 9 (Sept. 2016), 28 pages.

Digital Library

[31]

Maurice G. Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.

[32]

Michal Kosinski, David Stillwell, and Thore Graepel. 2013. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110, 15 (2013), 5802–5805.

[33]

Anurag Kumar, Vaishali Ahirwar, and Ravi Kumar Singh. 2017. A study on prediction of user behavior based on web server log files in web usage mining. International Journal of Engineering and Computer Science (2017).

[34]

Juhnyoung Lee, Mark Podlaseck, Edith Schonberg, and Robert Hoch. 2001. Visualization and analysis of clickstream data of online stores for understanding web merchandising. Data Mining and Knowledge Discovery 5, 1 (2001), 59–84.

Digital Library

[35]

Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1, 1 (2007).

[36]

Cheng-You Lien, Guo-Jhen Bai, and Hung-Hsuan Chen. 2019. Visited websites may reveal users’ demographic information and personality. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 248–252.

Digital Library

[37]

Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7, 1 (2003), 76–80.

Digital Library

[38]

Chao Liu, Fan Guo, and Christos Faloutsos. 2010. Bayesian browsing model: Exact inference of document relevance from petabyte-scale data. ACM Transactions on Knowledge Discovery from Data (TKDD) 4, 4 (2010), 1–26.

Digital Library

[39]

Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer Science & Business Media.

[40]

Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-based recommender systems: State of the art and trends. In Recommender Systems Handbook. Springer, 73–105.

[41]

Lin Lu, Margaret Dunham, and Yu Meng. 2005. Mining significant usage patterns from clickstream data. In International Workshop on Knowledge Discovery on the Web. Springer, 1–17.

[42]

Xin Luo, Mengchu Zhou, Shuai Li, Di Wu, Zhigang Liu, and Mingsheng Shang. 2021. Algorithms of unconstrained non-negative latent factor analysis for recommender systems. IEEE Trans. Big Data 7, 1 (2021), 227–240.

[43]

Masaya Murata, Hiroyuki Toda, Yumiko Matsuura, and Ryoji Kataoka. 2009. Access concentration detection in click logs to improve mobile Web-IR. Information Sciences 179, 12 (2009), 1859–1869.

Digital Library

[44]

Roger Newson. 2002. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. The Stata Journal 2, 1 (2002), 45–64.

[45]

Lianyong Qi, Xiaolong Xu, Xuyun Zhang, Wanchun Dou, Chunhua Hu, Yuming Zhou, and Jiguo Yu. 2018. Structural balance theory-based e-commerce recommendation over big rating data. IEEE Trans. Big Data 4, 3 (2018), 301–312.

[46]

Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.

Digital Library

[47]

Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web. 521–530.

Digital Library

[48]

Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. 2006. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 5762 (2006), 854–856.

[49]

J. Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborative filtering recommender systems. In The Adaptive Web. Springer, 291–324.

[50]

Michael Szell, Renaud Lambiotte, and Stefan Thurner. 2010. Multirelational organization of large-scale social networks in an online world. Proceedings of the National Academy of Sciences 107, 31 (2010), 13636–13641.

[51]

Jie Tang, Tiancheng Lou, Jon Kleinberg, and Sen Wu. 2016. Transfer learning to infer social ties across heterogeneous networks. ACM Trans. Inf. Syst. 34, 2, Article 7 (April 2016), 43 pages.

Digital Library

[52]

Maximilian Viermetz, Carsten Stolz, Vassil Gedov, and Michal Skubacz. 2006. Relevance and impact of tabbed browsing behavior on web usage mining. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI’06). IEEE, 262–269.

Digital Library

[53]

Zhongyuan Wang, Fang Wang, Haixun Wang, Zhirui Hu, Jun Yan, Fangtao Li, Ji-Rong Wen, and Zhoujun Li. 2016. Unsupervised head–modifier detection in search queries. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 2 (2016), 1–28.

Digital Library

[54]

Jian Wu, Kyle Mark Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander G. Ororbia, Douglas Jordan, Prasenjit Mitra, and C. Lee Giles. 2015. CiteseerX: AI in a digital library search engine. AI Magazine 36, 3 (2015), 35–48.

Digital Library

[55]

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality (JDIQ) 10, 4 (2018), 1–20.

Digital Library

[56]

Yi-Che Yang, Ping-Ching Lai, and Hung-Hsuan Chen. 2020. Empirically testing deep and shallow ranking models for click-through rate (CTR) prediction. In 2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI). IEEE, 147–152.

[57]

Jinyoung Yeo, Seung-won Hwang, Sungchul Kim, Eunyee Koh, and Nedim Lipka. 2018. Conversion prediction from clickstream: Modeling market prediction and customer predictability. IEEE Transactions on Knowledge and Data Engineering (2018).

[58]

Elad Yom-Tov, Shai Fine, David Carmel, and Adam Darlow. 2005. Learning to estimate query difficulty: Including applications to missing content detection and distributed information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 512–519.

Digital Library

[59]

Wu Youyou, Michal Kosinski, and David Stillwell. 2015. Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences 112, 4 (2015), 1036–1040.

[60]

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.

Digital Library

Cited By

Index Terms

Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web Logs
1. Applied computing
  1. Electronic commerce
    1. Online shopping
2. Information systems
  1. Information systems applications
    1. Computational advertising
  2. World Wide Web
    1. Web applications
      1. Electronic commerce
    2. Web mining
      1. Web log analysis

Recommendations

Behavior based web page evaluation

This paper describes our efforts to investigate factors in user browsing behavior to automatically evaluate Web pages that the user shows interest in. To evaluate Web pages automatically, we developed a client-side logging/analyzing tool: the GINIS ...
Behavior based web page evaluation
WWW '07: Proceedings of the 16th international conference on World Wide Web

This paper describes our efforts to investigate factors in user's browsing behavior to automatically evaluate web pages that the user shows interest in. To evaluate web pages automatically, we developed a client-side logging/analyzing tool: the GINIS ...
Hidden-Web induced by client-side scripting: an empirical study
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Client-side JavaScript is increasingly used for enhancing web application functionality, interactivity, and responsiveness. Through the execution of JavaScript code in browsers, the DOM tree representing a webpage at runtime, can be incrementally ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 14, Issue 2

June 2022

150 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3505186

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2022

Online AM: 08 February 2022

Accepted: 01 October 2021

Revised: 01 August 2021

Received: 01 March 2021

Published in JDIQ Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Ministry of Science and Technology of Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
187
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents