Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Skip blocks: reusing execution history to accelerate web scripts

Published: 12 October 2017 Publication History
  • Get Citation Alerts
  • Abstract

    With more and more web scripting languages on offer, programmers have access to increasing language support for web scraping tasks. However, in our experiences collaborating with data scientists, we learned that two issues still plague long-running scraping scripts: i) When a network or website goes down mid-scrape, recovery sometimes requires restarting from the beginning, which users find frustratingly slow. ii) Websites do not offer atomic snapshots of their databases; they update their content so frequently that output data is cluttered with slight variations of the same information — e.g., a tweet from profile 1 that is retweeted on profile 2 and scraped from both profiles, once with 52 responses then later with 53 responses.
    We introduce the skip block, a language construct that addresses both of these disparate problems. Programmers write lightweight annotations to indicate when the current object can be considered equivalent to a previously scraped object and direct the program to skip over the scraping actions in the block. The construct is hierarchical, so programs can skip over long or short script segments, allowing adaptive reuse of prior work. After network and server failures, skip blocks accelerate failure recovery by 7.9x on average. Even scripts that do not encounter failures benefit; because sites display redundant objects, skipping over them accelerates scraping by up to 2.1x. For longitudinal scraping tasks that aim to fetch only new objects, the second run exhibits an average speedup of 5.2x. Our small user study reveals that programmers can quickly produce skip block annotations.

    References

    [1]
    Adelberg, Brad. 1998. NoDoSE - a tool for semi-automatically extracting structured and semistructured data from text documents. In: Sigmod record.
    [2]
    Barman, Shaon, Chasins, Sarah, Bodik, Rastislav, & Gulwani, Sumit. 2016. Ringer: Web automation by demonstration. Pages 748–764 of: Proceedings of the 2016 acm sigplan international conference on object-oriented programming, systems, languages, and applications. OOPSLA 2016. New York, NY, USA: ACM.
    [3]
    Chang, Chia-Hui, Kayed, Mohammed, Girgis, Moheb Ramzy, & Shaalan, Khaled F. 2006. A survey of web information extraction systems. Ieee trans. on knowl. and data eng., 18(10), 1411–1428.
    [4]
    Chasins, Sarah. 2017 ( July). schasins/helena: A chrome extension for web automation and web scraping. https://github.com/ schasins/helena .
    [5]
    Flesca, Sergio, Manco, Giuseppe, Masciari, Elio, Rende, Eugenio, & Tagarelli, Andrea. 2004. Web wrapper induction: A brief survey. Ai commun., 17(2), 57–61.
    [6]
    Furche, Tim, Guo, Jinsong, Maneth, Sebastian, & Schallhart, Christian. 2016. Robust and noise resistant wrapper induction. Pages 773–784 of: Proceedings of the 2016 international conference on management of data. SIGMOD ’16. New York, NY, USA: ACM.
    [7]
    Greasemonkey. 2015 (Nov.). Greasemonkey :: Add-ons for firefox. https://addons.mozilla.org/enus/firefox/addon/greasemonkey/.
    [8]
    Hupp, Darris, & Miller, Robert C. 2007. Smart bookmarks: automatic retroactive macro recording on the web. Pages 81–90 of: Proceedings of the 20th annual acm symposium on user interface software and technology. UIST ’07. New York, NY, USA: ACM.
    [9]
    Import.io. 2016 (Mar.). Import.io | web data platform & free web scraping tool.
    [10]
    KimonoLabs. 2016 (Mar.). Kimono: Turn websites into structured APIs from your browser in seconds.
    [11]
    Koesnandar, Andhy, Elbaum, Sebastian, Rothermel, Gregg, Hochstein, Lorin, Scaffidi, Christopher, & Stolee, Kathryn T. 2008. Using assertions to help end-user programmers create dependable web macros. Pages 124–134 of: Proceedings of the 16th acm sigsoft international symposium on foundations of software engineering. SIGSOFT ’08/FSE-16. New York, NY, USA: ACM.
    [12]
    Kushmerick, Nicholas. 2000. Wrapper induction: Efficiency and expressiveness. Artificial intelligence, 118(1), 15 – 68.
    [13]
    Kushmerick, Nicholas, Weld, Daniel S., & Doorenbos, Robert. 1997. Wrapper induction for information extraction. In: Proc. ijcai-97.
    [14]
    Le, Vu, & Gulwani, Sumit. 2014. FlashExtract: A framework for data extraction by examples. Pages 542–553 of: Proceedings of the 35th acm sigplan conference on programming language design and implementation. PLDI ’14. New York, NY, USA: ACM.
    [15]
    Leshed, Gilly, Haber, Eben M., Matthews, Tara, & Lau, Tessa. 2008. CoScripter: automating & sharing how-to knowledge in the enterprise. Pages 1719–1728 of: Proceedings of the sigchi conference on human factors in computing systems. CHI ’08. New York, NY, USA: ACM.
    [16]
    Li, Ian, Nichols, Jeffrey, Lau, Tessa, Drews, Clemens, & Cypher, Allen. 2010. Here’s what i did: Sharing and reusing web activity with actionshot. Pages 723–732 of: Proceedings of the sigchi conference on human factors in computing systems. CHI ’10. New York, NY, USA: ACM.
    [17]
    Lin, James, Wong, Jeffrey, Nichols, Jeffrey, Cypher, Allen, & Lau, Tessa A. 2009. End-user programming of mashups with Vegemite. Pages 97–106 of: Proceedings of the 14th international conference on intelligent user interfaces. IUI ’09. New York, NY, USA: ACM.
    [18]
    Mahmud, Jalal, & Lau, Tessa. 2010. Lowering the barriers to website testing with cotester. Pages 169–178 of: Proceedings of the 15th international conference on intelligent user interfaces. IUI ’10. New York, NY, USA: ACM.
    [19]
    Mayer, Mikaël, Soares, Gustavo, Grechkin, Maxim, Le, Vu, Marron, Mark, Polozov, Oleksandr, Singh, Rishabh, Zorn, Benjamin, & Gulwani, Sumit. 2015. User interaction models for disambiguation in programming by example. Pages 291–301 of: Proceedings of the 28th annual acm symposium on user interface software & technology. UIST ’15. New York, NY, USA: ACM.
    [20]
    Muslea, Ion, Minton, Steve, & Knoblock, Craig. 1999. A hierarchical approach to wrapper induction. Pages 190–197 of: Proceedings of the third annual conference on autonomous agents. AGENTS ’99. New York, NY, USA: ACM.
    [21]
    Ni, Yang, Menon, Vijay S., Adl-Tabatabai, Ali-Reza, Hosking, Antony L., Hudson, Richard L., Moss, J. Eliot B., Saha, Bratin, & Shpeisman, Tatiana. 2007. Open nesting in software transactional memory. Pages 68–78 of: Proceedings of the 12th acm sigplan symposium on principles and practice of parallel programming. PPoPP ’07. New York, NY, USA: ACM.
    [22]
    Nokogiri. 2016 (Nov.). Tutorials - nokogiri. http://www.nokogiri.org/ .
    [23]
    Omari, Adi, Shoham, Sharon, & Yahav, Eran. 2017. Synthesis of forgiving data extractors. Pages 385–394 of: Proceedings of the tenth acm international conference on web search and data mining. WSDM ’17. New York, NY, USA: ACM.
    [24]
    Platypus. 2013 (Nov.). Platypus. http://platypus.mozdev.org/ .
    [25]
    Richardson, Leonard. 2016 (Mar.). Beautiful Soup: We called him Tortoise because he taught us. http://www.crummy.com/ software/BeautifulSoup/ .
    [26]
    Scrapy. 2013 ( July). Scrapy. http://scrapy.org/ .
    [27]
    Selenium. 2013 ( July). Selenium-web browser automation. http://seleniumhq.org/ .
    [28]
    Selenium. 2016 (Mar.). Selenium IDE plugins. http://www.seleniumhq.org/projects/ide/ .
    [29]
    StackOverflow. 2017. Posts containing “incremental scraping” - stack overflow.
    [30]
    VisualWebRipper. 2017 (Apr.). Visual web ripper | data extraction software. http://visualwebripper.com/ .
    [31]
    Zheng, Shuyi, Song, Ruihua, Wen, Ji-Rong, & Giles, C. Lee. 2009. Efficient record-level wrapper induction. Pages 47–56 of: Proceedings of the 18th acm conference on information and knowledge management. CIKM ’09. New York, NY, USA: ACM.

    Cited By

    View all
    • (2023)ImageEye: Batch Image Processing using Program SynthesisProceedings of the ACM on Programming Languages10.1145/35912487:PLDI(686-711)Online publication date: 6-Jun-2023
    • (2022)Synthesizing analytical SQL queries from computation demonstrationProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523712(168-182)Online publication date: 9-Jun-2022
    • (2022)Informing Housing Policy through Web Automation: Lessons for Designing Programming Tools for Domain ExpertsExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3503575(1-9)Online publication date: 27-Apr-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Programming Languages
    Proceedings of the ACM on Programming Languages  Volume 1, Issue OOPSLA
    October 2017
    1786 pages
    EISSN:2475-1421
    DOI:10.1145/3152284
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2017
    Published in PACMPL Volume 1, Issue OOPSLA

    Check for updates

    Author Tags

    1. End-User Programming
    2. Incremental Scraping
    3. Programming By Demonstration
    4. Web Scraping

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)75
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)ImageEye: Batch Image Processing using Program SynthesisProceedings of the ACM on Programming Languages10.1145/35912487:PLDI(686-711)Online publication date: 6-Jun-2023
    • (2022)Synthesizing analytical SQL queries from computation demonstrationProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523712(168-182)Online publication date: 9-Jun-2022
    • (2022)Informing Housing Policy through Web Automation: Lessons for Designing Programming Tools for Domain ExpertsExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3503575(1-9)Online publication date: 27-Apr-2022
    • (2021)Hindsight logging for model trainingProceedings of the VLDB Endowment10.14778/3436905.343692514:4(682-693)Online publication date: 22-Feb-2021
    • (2021)Web question answering with neurosymbolic program synthesisProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454047(328-343)Online publication date: 19-Jun-2021
    • (2021)DIY assistant: a multi-modal end-user programmable virtual assistantProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454046(312-327)Online publication date: 19-Jun-2021
    • (2020)Structure interpretation of text formatsProceedings of the ACM on Programming Languages10.1145/34282804:OOPSLA(1-29)Online publication date: 13-Nov-2020
    • (2020)Privacy-Preserving Script Sharing in GUI-based Programming-by-Demonstration SystemsProceedings of the ACM on Human-Computer Interaction10.1145/33928694:CSCW1(1-23)Online publication date: 29-May-2020
    • (2020)Racialized Discourse in Seattle Rental Ad TextsSocial Forces10.1093/sf/soaa07599:4(1432-1456)Online publication date: 3-Aug-2020
    • (2019)Barriers to Reproducible Scientific Programming2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VLHCC.2019.8818907(217-221)Online publication date: Oct-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media