Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2884781.2884842acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Cross-supervised synthesis of web-crawlers

Published: 14 May 2016 Publication History

Abstract

A web-crawler is a program that automatically and systematically tracks the links of a website and extracts information from its pages. Due to the different formats of websites, the crawling scheme for different sites can differ dramatically. Manually customizing a crawler for each specific site is time consuming and error-prone. Furthermore, because sites periodically change their format and presentation, crawling schemes have to be manually updated and adjusted. In this paper, we present a technique for automatic synthesis of web-crawlers from examples. The main idea is to use hand-crafted (possibly partial) crawlers for some websites as the basis for crawling other sites that contain the same kind of information. Technically, we use the data on one site to identify data on another site. We then use the identified data to learn the website structure and synthesize an appropriate extraction scheme. We iterate this process, as synthesized extraction schemes result in additional data to be used for re-learning the website structure. We implemented our approach and automatically synthesized 30 crawlers for websites from nine different categories: books, TVs, conferences, universities, cameras, phones, movies, songs, and hotels.

References

[1]
An, Y. J., Geller, J., Wu, Y.-T., and Chun, S. Semantic deep web: automatic attribute extraction from the deep web data sources. In Proceedings of the 2007 ACM symposium on Applied computing (2007), ACM, pp. 1667--1672.
[2]
Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (2003), ACM, pp. 337--348.
[3]
Chang, C.-H., and Lui, S.-C. IEPAD: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web (2001), WWW '01, pp. 681--688.
[4]
Chapman, S., Dingli, A., and Ciravegna, F. Armadillo: harvesting information for the semantic web. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (2004), ACM, pp. 598--598.
[5]
Chuang, S.-L., and Hsu, J.-J. Tree-structured template generation for web pages. In Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on (2004), IEEE, pp. 327--333.
[6]
Ciravegna, F., Chapman, S., Dingli, A., and Wilks, Y. Learning to harvest information for the semantic web. In The Semantic Web: Research and Applications. Springer, 2004, pp. 312--326.
[7]
Clark, J., Derose, S., et al. Xml path language (xpath). W3C recommendation 16 (1999).
[8]
Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases (2001), VLDB '01, pp. 109--118.
[9]
Dalvi, N., Bohannon, P., and Sha, F. Robust web extraction: An approach based on a probabilistic tree-edit model. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2009), SIGMOD '09, ACM, pp. 335--348.
[10]
Dalvi, N., Kumar, R., and Soliman, M. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment 4, 4 (2011), 219--230.
[11]
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J. A., et al. Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In Proceedings of the 12th international conference on World Wide Web (2003), ACM, pp. 178--186.
[12]
Gabrilovich, E., and Markovitch, S. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (2007), vol. 7, pp. 1606--1611.
[13]
Galenson, J., Reames, P., Bodik, R., Hartmann, B., and Sen, K. Codehint: Dynamic and interactive synthesis of code snippets. In Proceedings of the 36th International Conference on Software Engineering (2014), ACM, pp. 653--663.
[14]
Gentile, A. L., Zhang, Z., Augenstein, I., and Ciravegna, F. Unsupervised wrapper induction using linked data. In Proceedings of the Seventh International Conference on Knowledge Capture (New York, NY, USA, 2013), K-CAP '13, ACM, pp. 41--48.
[15]
Grigalis, T. Towards web-scale structured web data extraction. In Proceedings of the sixth ACM international conference on Web search and data mining (2013), ACM, pp. 753--758.
[16]
Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S. H., Tengli, A., and Tiwari, C. Web-scale information extraction with vertex. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on (2011), IEEE, pp. 1209--1220.
[17]
Gulwani, S., Jha, S., Tiwari, A., and Venkatesan, R. Synthesis of loop-free programs. In ACM SIGPLAN Notices (2011), vol. 46, ACM, pp. 62--73.
[18]
Hao, Q., Cai, R., Pang, Y., and Zhang, L. From one tree to a forest: a unified solution for structured web data extraction. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (2011), ACM, pp. 775--784.
[19]
Hawkins, P., Aiken, A., Fisher, K., Rinard, M. C., and Sagiv, M. Data structure fusion. In Programming Languages and Systems - 8th Asian Symposium, APLAS 2010 (2010), pp. 204--221.
[20]
Hong, J. L. Data extraction for deep web using wordnet. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 41, 6 (2011), 854--868.
[21]
Jaccard, P. The distribution of the flora in the alpine zone. New Phytologist 11, 37--50.
[22]
Jha, S., Gulwani, S., Seshia, S., Tiwari, A., et al. Oracle-guided component-based program synthesis. In Software Engineering, 2010 ACM/IEEE 32nd International Conference on (2010), vol. 1, IEEE, pp. 215--224.
[23]
Jiang, L., Wu, Z., Feng, Q., Liu, J., and Zheng, Q. Efficient deep web crawling using reinforcement learning. In Advances in Knowledge Discovery and Data Mining. Springer, 2010, pp. 428--439.
[24]
Kushmerick, N., Weld, D. S., and Doorenbos, R. B. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 97, Nagoya, Japan, August 23-29, 1997, 2 Volumes (1997), pp. 729--737.
[25]
Le, V., and Gulwani, S. Flashextract: a framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (2014), ACM, p. 55.
[26]
Leotta, M., Stocco, A., Ricca, F., and Tonella, P. Reducing web test cases aging by means of robust xpath locators. In Proceedings of 25th International Symposium on Software Reliability Engineering Workshops (ISSREW 2014) (2014), pp. 449--454.
[27]
Liu, B., Grossman, R., and Zhai, Y. Mining data records in web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (2003), ACM, pp. 601--606.
[28]
Liu, D., Wang, X., Yan, Z., and Li, Q. Robust web data extraction: a novel approach based on minimum cost script edit model. In Web Information Systems and Mining. Springer, 2012, pp. 497--509.
[29]
Liu, L., Pu, C., and Han, W. Xwrap: An xml-enabled wrapper construction system for web information sources. In Data Engineering, 2000. Proceedings. 16th International Conference on (2000), IEEE, pp. 611--621.
[30]
Liu, W., Meng, X., and Meng, W. Vide: A vision-based approach for deep web data extraction. Knowledge and Data Engineering, IEEE Transactions on 22, 3 (2010), 447--460.
[31]
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. Extracting data records from the web using tag path clustering. In Proceedings of the 18th International Conference on World Wide Web (New York, NY, USA, 2009), WWW '09, ACM, pp. 981--990.
[32]
Michelson, M., and Knoblock, C. A. Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal of Document Analysis and Recognition (IJDAR) 10, 3--4 (2007), 211--226.
[33]
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[34]
Reis, D. D. C., Golgher, P. B., Silva, A. S., and Laender, A. Automatic web news extraction using tree edit distance. In Proceedings of the 13th international conference on World Wide Web (2004), ACM, pp. 502--511.
[35]
Rennie, J., McCallum, A., et al. Using reinforcement learning to spider the web efficiently. In ICML (1999), vol. 99, pp. 335--343.
[36]
Sleiman, H. A., and Corchuelo, R. Tex: An efficient and effective unsupervised web information extractor. Knowledge-Based Systems 39 (2013), 109--123.
[37]
Thamviset, W., and Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web (2013), 1--31.
[38]
Thamviset, W., and Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web 17, 5 (2014), 1109--1139.
[39]
Vydiswaran, V. V., and Sarawagi, S. Learning to extract information from large websites using sequential models. In COMAD (2005), pp. 3--14.
[40]
Zhai, Y., and Liu, B. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web (2005), ACM, pp. 76--85.
[41]
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2006), KDD '06, ACM, pp. 494--503.

Cited By

View all
  • (2021)Synthesis of web layouts from examplesProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468533(651-663)Online publication date: 20-Aug-2021
  • (2018)Test migration for efficient large-scale assessment of mobile app coding assignmentsProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3213846.3213854(164-175)Online publication date: 12-Jul-2018
  • (2018)Programming not only by exampleProceedings of the 40th International Conference on Software Engineering10.1145/3180155.3180189(1114-1124)Online publication date: 27-May-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '16: Proceedings of the 38th International Conference on Software Engineering
May 2016
1235 pages
ISBN:9781450339001
DOI:10.1145/2884781
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • European Union

Conference

ICSE '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Synthesis of web layouts from examplesProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468533(651-663)Online publication date: 20-Aug-2021
  • (2018)Test migration for efficient large-scale assessment of mobile app coding assignmentsProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3213846.3213854(164-175)Online publication date: 12-Jul-2018
  • (2018)Programming not only by exampleProceedings of the 40th International Conference on Software Engineering10.1145/3180155.3180189(1114-1124)Online publication date: 27-May-2018
  • (2017)An elder-centered trip recommendation method leveraging crowd sourcing technologies2017 10th International Conference on Ubi-media Computing and Workshops (Ubi-Media)10.1109/UMEDIA.2017.8074150(1-6)Online publication date: Aug-2017
  • (2017)Abstraction-Based Interaction Model for SynthesisVerification, Model Checking, and Abstract Interpretation10.1007/978-3-319-73721-8_18(382-405)Online publication date: 29-Dec-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media