Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web Logs

Published: 23 March 2022 Publication History

Abstract

Web logs have been widely used to represent the web page visits of online users. However, we found that web logs in Chrome’s browsing history only record 57% of users’ visited websites, i.e., nearly half of a user’s website visits are not recorded. Additionally, 5.1% of the visits recorded in the web log occur because of unconscious user actions, i.e., these page visits are not initiated from users. We created a Google Chrome plugin and recruited users to install the plugin to collect and analyze the conscious URL visits, unconscious URL visits, and “missing” URL visits (i.e., the visits unrecorded in the traditional web log). We reported the statistics of these behaviors. We showed that sorting popular website categories based on traditional web logs differs from the rankings obtained when including missing visits or excluding unintentional visits. We predicted users’ future behaviors based on three types of training data – all the visits in modern web logs, the intentional visits in web logs, and the intentional visits plus missing visits in web logs. The experimental results indicate that missing visits in web logs may contain additional information, and unintentional visits in web logs may contain more noise than information for user modeling. Consequently, we need to be careful of the observations and conclusions derived from web log analyses because the web log data could be an incomplete and noisy dataset of a user’s visited web pages.

References

[1]
Guo-Jhen Bai, Cheng-You Lien, and Hung-Hsuan Chen. 2019. Co-learning multiple browsing tendencies of a user by matrix factorization-based multitask learning. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 253–257.
[2]
Ting Bai, Wanye Xin Zhao, Yulan He, Jian-Yun Nie, and Ji-Rong Wen. 2018. Characterizing and predicting early reviewers for effective product marketing on e-commerce websites. IEEE Transactions on Knowledge and Data Engineering 30, 12 (2018), 2271–2284.
[3]
Kenneth J. Berry, Janis E. Johnston, Sammy Zahran, and Paul W. Mielke. 2009. Stuart’s tau measure of effect size for ordinal variables: Some methodological considerations. Behavior research methods 41, 4 (2009), 1144–1148.
[4]
Rahul Bhagat, Srevatsan Muralidharan, Alex Lobzhanidze, and Shankar Vishwanath. 2018. Buy it again: Modeling repeat purchase recommendations. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[5]
Robin Burke. 2002. Hybrid recommender systems: Survey and experiments. User Modeling and User-adapted Interaction 12, 4 (2002), 331–370.
[6]
Damon Centola. 2010. The spread of behavior in an online social network experiment. Science 329, 5996 (2010), 1194–1197.
[7]
Hung-Hsuan Chen. 2017. Weighted-SVD: Matrix factorization with weights on the latent factors. arXiv preprint arXiv:1710.00482 (2017).
[8]
Hung-Hsuan Chen. 2018. Behavior2Vec: Generating distributed representations of users’ behaviors on products for recommender systems. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 4 (2018), 1–20.
[9]
Hung-Hsuan Chen and Pu Chen. 2019. Differentiating regularization weights–A simple mechanism to alleviate cold start in recommender systems. ACM Transactions on Knowledge Discovery from Data (TKDD) 13, 1 (2019), 1–22.
[10]
Hung-Hsuan Chen, Chu-An Chung, Hsin-Chien Huang, and Wen Tsui. 2017. Common pitfalls in training and evaluating recommender systems. ACM SIGKDD Explorations Newsletter 19, 1 (2017), 37–45.
[11]
Hung-Hsuan Chen, Liang Gou, Xiaolong Zhang, and Clyde Lee Giles. 2011. CollabSeer: A search engine for collaboration discovery. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. 231–240.
[12]
Hung-Hsuan Chen, Madian Khabsa, and C. Lee Giles. 2014. The feasibility of investing in manual correction of metadata for a large-scale digital library. In IEEE/ACM Joint Conference on Digital Libraries. IEEE, 225–228.
[13]
Hung-Hsuan Chen, Pucktada Treeratpituk, Prasenjit Mitra, and C. Lee Giles. 2013. CSSeer: An expert recommendation system based on CiteseerX. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. 381–382.
[14]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794.
[15]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 7–10.
[16]
Szu-Yu Chou, Jyh-Shing Roger Jang, and Yi-Hsuan Yang. 2018. Fast tensor factorization for large-scale context-aware recommendation from implicit feedback. IEEE Transactions on Big Data 6, 1 (2018), 201–208.
[17]
Wei Dai, Qing Zhang, Weike Pan, and Zhong Ming. 2020. Transfer to rank for Top-N recommendation. IEEE Trans. Big Data 6, 4 (2020), 770–779.
[18]
Susan Dumais, Robin Jeffries, Daniel M. Russell, Diane Tang, and Jaime Teevan. 2014. Understanding user behavior through log data and analysis. In Ways of Knowing in HCI. Springer, 349–372.
[19]
Gustav Theodor Fechner, Davis H. Howes, and Edwin Garrigues Boring. 1966. Elements of Psychophysics. Vol. 1. Holt, Rinehart and Winston New York.
[20]
Milan Gocic and Slavisa Trajkovic. 2013. Analysis of changes in meteorological variables using Mann-Kendall and Sen’s slope estimator statistical tests in Serbia. Global and Planetary Change 100 (2013), 172–182.
[21]
Liang Gou, Hung-Hsuan Chen, Jung-Hyun Kim, Xiaolong Zhang, and C. Lee Giles. 2010. SNDocRank: A social network-based video search ranking framework. In Proceedings of the International Conference on Multimedia Information Retrieval. 367–376.
[22]
Liang Gou, Xiaolong Zhang, Hung-Hsuan Chen, Jung-Hyun Kim, and C. Lee Giles. 2010. Social network document ranking. In Proceedings of the 10th Annual Joint Conference on Digital Libraries. 313–322.
[23]
Mihajlo Grbovic and Haibin Cheng. 2018. Real-time personalization using embeddings for search ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[24]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
[25]
Li-Yuan Hsu, Chia-Hao Kao, I-Sheng Jheng, and Hung-Hsuan Chen. 2021. Toward building an academic search engine understanding the purposes of the matched sentences in an abstract. IEEE Access 9 (2021), 109344–109354.
[26]
Jeff Huang, Thomas Lin, and Ryen W. White. 2012. No search result left behind: Branching behavior with browser tabs. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. 203–212.
[27]
Jeff Huang and Ryen W. White. 2010. Parallel browsing behavior on the web. In Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. 13–18.
[28]
Rolf Jagerman, Krisztian Balog, and Maarten De Rijke. 2018. Opensearch: Lessons learned from an online evaluation campaign. Journal of Data and Information Quality (JDIQ) 10, 3 (2018), 1–15.
[29]
Harsh Jhamtani, Rishiraj Saha Roy, Niyati Chhaya, and Eric Nyberg. 2017. Leveraging site search logs to identify missing content on enterprise webpages. In European Conference on Information Retrieval. Springer, 506–512.
[30]
Di Jiang, Yongxin Tong, and Yuanfeng Song. 2016. Cross-lingual topic discovery from multilingual search engine query log. ACM Trans. Inf. Syst. 35, 2, Article 9 (Sept. 2016), 28 pages.
[31]
Maurice G. Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
[32]
Michal Kosinski, David Stillwell, and Thore Graepel. 2013. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110, 15 (2013), 5802–5805.
[33]
Anurag Kumar, Vaishali Ahirwar, and Ravi Kumar Singh. 2017. A study on prediction of user behavior based on web server log files in web usage mining. International Journal of Engineering and Computer Science (2017).
[34]
Juhnyoung Lee, Mark Podlaseck, Edith Schonberg, and Robert Hoch. 2001. Visualization and analysis of clickstream data of online stores for understanding web merchandising. Data Mining and Knowledge Discovery 5, 1 (2001), 59–84.
[35]
Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1, 1 (2007).
[36]
Cheng-You Lien, Guo-Jhen Bai, and Hung-Hsuan Chen. 2019. Visited websites may reveal users’ demographic information and personality. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 248–252.
[37]
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7, 1 (2003), 76–80.
[38]
Chao Liu, Fan Guo, and Christos Faloutsos. 2010. Bayesian browsing model: Exact inference of document relevance from petabyte-scale data. ACM Transactions on Knowledge Discovery from Data (TKDD) 4, 4 (2010), 1–26.
[39]
Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer Science & Business Media.
[40]
Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-based recommender systems: State of the art and trends. In Recommender Systems Handbook. Springer, 73–105.
[41]
Lin Lu, Margaret Dunham, and Yu Meng. 2005. Mining significant usage patterns from clickstream data. In International Workshop on Knowledge Discovery on the Web. Springer, 1–17.
[42]
Xin Luo, Mengchu Zhou, Shuai Li, Di Wu, Zhigang Liu, and Mingsheng Shang. 2021. Algorithms of unconstrained non-negative latent factor analysis for recommender systems. IEEE Trans. Big Data 7, 1 (2021), 227–240.
[43]
Masaya Murata, Hiroyuki Toda, Yumiko Matsuura, and Ryoji Kataoka. 2009. Access concentration detection in click logs to improve mobile Web-IR. Information Sciences 179, 12 (2009), 1859–1869.
[44]
Roger Newson. 2002. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. The Stata Journal 2, 1 (2002), 45–64.
[45]
Lianyong Qi, Xiaolong Xu, Xuyun Zhang, Wanchun Dou, Chunhua Hu, Yuming Zhou, and Jiguo Yu. 2018. Structural balance theory-based e-commerce recommendation over big rating data. IEEE Trans. Big Data 4, 3 (2018), 301–312.
[46]
Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.
[47]
Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web. 521–530.
[48]
Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. 2006. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 5762 (2006), 854–856.
[49]
J. Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborative filtering recommender systems. In The Adaptive Web. Springer, 291–324.
[50]
Michael Szell, Renaud Lambiotte, and Stefan Thurner. 2010. Multirelational organization of large-scale social networks in an online world. Proceedings of the National Academy of Sciences 107, 31 (2010), 13636–13641.
[51]
Jie Tang, Tiancheng Lou, Jon Kleinberg, and Sen Wu. 2016. Transfer learning to infer social ties across heterogeneous networks. ACM Trans. Inf. Syst. 34, 2, Article 7 (April 2016), 43 pages.
[52]
Maximilian Viermetz, Carsten Stolz, Vassil Gedov, and Michal Skubacz. 2006. Relevance and impact of tabbed browsing behavior on web usage mining. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI’06). IEEE, 262–269.
[53]
Zhongyuan Wang, Fang Wang, Haixun Wang, Zhirui Hu, Jun Yan, Fangtao Li, Ji-Rong Wen, and Zhoujun Li. 2016. Unsupervised head–modifier detection in search queries. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 2 (2016), 1–28.
[54]
Jian Wu, Kyle Mark Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander G. Ororbia, Douglas Jordan, Prasenjit Mitra, and C. Lee Giles. 2015. CiteseerX: AI in a digital library search engine. AI Magazine 36, 3 (2015), 35–48.
[55]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality (JDIQ) 10, 4 (2018), 1–20.
[56]
Yi-Che Yang, Ping-Ching Lai, and Hung-Hsuan Chen. 2020. Empirically testing deep and shallow ranking models for click-through rate (CTR) prediction. In 2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI). IEEE, 147–152.
[57]
Jinyoung Yeo, Seung-won Hwang, Sungchul Kim, Eunyee Koh, and Nedim Lipka. 2018. Conversion prediction from clickstream: Modeling market prediction and customer predictability. IEEE Transactions on Knowledge and Data Engineering (2018).
[58]
Elad Yom-Tov, Shai Fine, David Carmel, and Adam Darlow. 2005. Learning to estimate query difficulty: Including applications to missing content detection and distributed information retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 512–519.
[59]
Wu Youyou, Michal Kosinski, and David Stillwell. 2015. Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences 112, 4 (2015), 1036–1040.
[60]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 14, Issue 2
June 2022
150 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3505186
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2022
Online AM: 08 February 2022
Accepted: 01 October 2021
Revised: 01 August 2021
Received: 01 March 2021
Published in JDIQ Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Clickstream
  2. user behavior
  3. log analysis
  4. user modeling

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Ministry of Science and Technology of Taiwan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 187
    Total Downloads
  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media