Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3442381.3450050acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Towards Realistic and ReproducibleWeb Crawl Measurements

Published: 03 June 2021 Publication History

Abstract

Accurate web measurement is critical for understanding and improving security and privacy online. Such measurements implicitly assume that automated crawls generalize to typical web user experience. But anecdotal evidence suggests the web behaves differently when seen via well-known measurement endpoints or measurement automation frameworks, for various reasons. Our work improves the state of web privacy and security by investigating how key measurements differ when using naive crawling tool defaults vs. careful attempts to match “real” users across the Tranco top 25k web domains. We find web privacy and security measurements significantly affected by vantage point and browser configuration. We conclude that unless researchers ensure their web measurement tools match real world user experience, the research community is likely missing important signals systematically. For example, we find browser configuration alone causing shifts in 19% of known ad and tracking domains encountered and altering the loading frequency of up to 10% of distinct JavaScript code units executed. We find network vantage point having similar, though less dramatic, effects on the same web metrics. To ensure reproducibility, we carefully document our methodology and publish both our code and collected data.

References

[1]
[n.d.]. catapult - Git at Google. https://chromium.googlesource.com/catapult/. Accessed: 2019-5-12.
[2]
[n.d.]. Historical trends in the usage statistics of dns server providers. https://w3techs.com/technologies/history_overview/dns_server. Accessed: 2020-5-29.
[3]
[n.d.]. New Industry Benchmarks for Mobile Page Speed - Think With Google. https://www.thinkwithgoogle.com/marketing-resources/data-measurement/mobile-page-speed-new-industry-benchmarks/. Accessed: 2020-5-6.
[4]
[n.d.]. Puppeteer. https://pptr.dev/. Accessed: 2019-5-12.
[5]
2015. GO Simple Tunnel - a simple tunnel written in golang. https://github.com/ginuerzh/gost. Accessed: 2020-06-02.
[6]
2018. . https://antoinevastel.com/bot%20detection/2018/01/17/detect-chrome-headless-v2.html. Accessed: 2020-10-16.
[7]
Eytan Adar, Jaime Teevan, Susan T. Dumais, and Jonathan L. Elsas. 2009. The Web Changes Everything: Understanding the Dynamics of Web Content. In Proceedings of the Second ACM International Conference on Web Search and Data Mining(Barcelona, Spain) (WSDM ’09). Association for Computing Machinery, New York, NY, USA, 282–291. https://doi.org/10.1145/1498759.1498837
[8]
Sadia Afroz, Michael Carl Tschantz, Shaarif Sajid, Shoaib Asif Qazi, Mobin Javed, and Vern Paxson. 2018. Exploring server-side blocking of regions. arXiv preprint arXiv:1805.11606(2018).
[9]
Syed Suleman Ahmad, Muhammad Daniyal Dar, Muhammad Fareed Zaffar, Narseo Vallina-Rodriguez, and Rishab Nithyanand. 2020. Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web. In The Web Conference.
[10]
Vaibhav Bajpai and Jürgen Schönwälder. 2015. A survey on internet performance measurement platforms and related standardization efforts. IEEE Communications Surveys & Tutorials 17, 3 (2015), 1313–1341.
[11]
Derek Doran and Swapna S Gokhale. 2011. Web Robot Detection Techniques: Overview and Limitations. Data Mining and Knowledge Discovery 22, 1-2 (2011), 183–210.
[12]
Steven Englehardt and Arvind Narayanan. 2016. Online Tracking: A 1-million-site Measurement and Analysis. In Proceedings of the ACM Conference on Computer and Communications Security (CCS). ACM. https://doi.org/10.1145/2976749.2978313
[13]
Nathaniel Fruchter, Hsin Miao, Scott Stevenson, and Rebecca Balebako. 2015. Variations in tracking in relation to geographic location. arXiv preprint arXiv:1506.04103(2015).
[14]
Luca Invernizzi, Kurt Thomas, Alexandros Kapravelos, Oxana Comanescu, Jean-Michel Picod, and Elie Bursztein. 2016. Cloak of visibility: Detecting when machines browse a different web. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE.
[15]
Costas Iordanou, Claudio Soriente, Michael Sirivianos, and Nikolaos Laoutaris. 2017. Who is fiddling with prices?: Building and deploying a watchdog service for e-commerce. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 376–389.
[16]
Jordan Jueckstock and Alexandros Kapravelos. 2019. VisibleV8: In-browser Monitoring of JavaScript in the Wild. /projects/vv8/. In Proceedings of the ACM Internet Measurement Conference (IMC).
[17]
Sheharbano Khattak, David Fifield, Sadia Afroz, Mobin Javed, Srikanth Sundaresan, Vern Paxson, Steven J Murdoch, and Damon McCoy. 2016. Do you see what I see? differential treatment of anonymous users. In Proceedings of the Symposium on Network and Distributed System Security (NDSS). Internet Society.
[18]
Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczyński, and Wouter Joosen. 2019. Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. In Proceedings of the Symposium on Network and Distributed System Security (NDSS). https://doi.org/10.14722/ndss.2019.23386
[19]
Kirill Levchenko, Amogh Dhamdhere, Bradley Huffaker, Kc Claffy, Mark Allman, and Vern Paxson. 2017. Packetlab: a universal measurement endpoint interface. In Proceedings of the 2017 Internet Measurement Conference. ACM, 254–260.
[20]
J. R. Mayer and J. C. Mitchell. 2012. Third-Party Web Tracking: Policy and Technology. In Proceedings of the IEEE Symposium on Security and Privacy.
[21]
Fiona Fui-Hoon Nah. 2004. A study on tolerable waiting time: how long are web users willing to wait?Behaviour & Information Technology 23, 3 (2004), 153–163.
[22]
A. Oest, Y. Safaei, A. Doupé, G. Ahn, B. Wardman, and K. Tyers. 2019. PhishFarm: A Scalable Framework for Measuring the Effectiveness of Evasion Techniques against Browser Phishing Blacklists. In Proceedings of the IEEE Symposium on Security and Privacy. 1344–1361.
[23]
Franziska Roesner, Tadayoshi Kohno, and David Wetherall. 2012. Detecting and defending against third-party tracking on the web. In Proceedings of the USENIX symposium on Networked Systems Design and Implementation (NSDI). USENIX Association.
[24]
Peter Snyder, Lara Ansari, Cynthia Taylor, and Chris Kanich. 2016. Browser Feature Usage on the Modern Web. In Proceedings of the ACM SIGCOMM conference on Internet measurement conference (IMC). ACM.
[25]
Michael Carl Tschantz, Sadia Afroz, Shaarif Sajid, Shoaib Asif Qazi, Mobin Javed, and Vern Paxson. 2018. A bestiary of blocking: The motivations and modes behind website unavailability. In 8th {USENIX} Workshop on Free and Open Communications on the Internet ({FOCI} 18).
[26]
Phani Vadrevu and Roberto Perdisci. 2019. What You See is NOT What You Get: Discovering and Tracking Social Engineering Attack Campaigns. In Proceedings of the ACM Internet Measurement Conference (IMC).
[27]
Tom Van Goethem, Victor Le Pochat, and Wouter Joosen. 2019. Mobile Friendly or Attacker Friendly? A Large-Scale Security Evaluation of Mobile-First Websites. In Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security (Auckland, New Zealand) (Asia CCS ’19). Association for Computing Machinery, New York, NY, USA, 206–213. https://doi.org/10.1145/3321705.3329855
[28]
David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, Fredrik Wollsén, and Martin Lopatka. 2020. The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing. In The Web Conference.

Cited By

View all
  • (2024)Evaluating the Impact of Design Decisions on Passive DNS-Based Domain Rankings2024 8th Network Traffic Measurement and Analysis Conference (TMA)10.23919/TMA62044.2024.10559182(1-11)Online publication date: 21-May-2024
  • (2024)Targeted and Troublesome: Tracking and Advertising on Children’s Websites2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00118(1517-1535)Online publication date: 19-May-2024
  • (2024)To Auth or Not To Auth? A Comparative Analysis of the Pre- and Post-Login Security Landscape2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00094(1500-1516)Online publication date: 19-May-2024
  • Show More Cited By
  1. Towards Realistic and ReproducibleWeb Crawl Measurements

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '21: Proceedings of the Web Conference 2021
    April 2021
    4054 pages
    ISBN:9781450383127
    DOI:10.1145/3442381
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WWW '21
    Sponsor:
    WWW '21: The Web Conference 2021
    April 19 - 23, 2021
    Ljubljana, Slovenia

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)119
    • Downloads (Last 6 weeks)31
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Evaluating the Impact of Design Decisions on Passive DNS-Based Domain Rankings2024 8th Network Traffic Measurement and Analysis Conference (TMA)10.23919/TMA62044.2024.10559182(1-11)Online publication date: 21-May-2024
    • (2024)Targeted and Troublesome: Tracking and Advertising on Children’s Websites2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00118(1517-1535)Online publication date: 19-May-2024
    • (2024)To Auth or Not To Auth? A Comparative Analysis of the Pre- and Post-Login Security Landscape2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00094(1500-1516)Online publication date: 19-May-2024
    • (2024)The Inventory is Dark and Full of Misinformation: Understanding Ad Inventory Pooling in the Ad-Tech Supply Chain2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00003(1590-1608)Online publication date: 19-May-2024
    • (2024)Aligning agent-based testing (ABT) with the experimental research paradigm: a literature review and best practicesJournal of Computational Social Science10.1007/s42001-024-00283-67:2(1625-1644)Online publication date: 16-May-2024
    • (2023)Analyzing Cyber Security Research Practices through a Meta-Research FrameworkProceedings of the 16th Cyber Security Experimentation and Test Workshop10.1145/3607505.3607523(64-74)Online publication date: 7-Aug-2023
    • (2023)Detection of Inconsistencies in Privacy Practices of Browser Extensions2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179338(2780-2798)Online publication date: May-2023
    • (2022)How gullible are web measurement tools?Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569131(171-186)Online publication date: 30-Nov-2022
    • (2022)Towards Automated Auditing for Account and Session Management Flaws in Single Sign-On Deployments2022 IEEE Symposium on Security and Privacy (SP)10.1109/SP46214.2022.9833753(1774-1790)Online publication date: May-2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media