GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

Mirtaheri, Seyed M.; von Bochmann, Gregor; Jourdan, Guy-Vincent; Onut, Iosif Viorel

doi:10.1007/978-3-319-09581-3_14

Seyed M. Mirtaheri¹⁷,
Gregor von Bochmann¹⁷,
Guy-Vincent Jourdan¹⁷ &
…
Iosif Viorel Onut¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 8593))

Included in the following conference series:

International Conference on Networked Systems

865 Accesses

Abstract

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Review of Focused Crawling Schemes for Search Engine

A Study on Different Types of Web Crawlers

Focused crawling for the hidden web

Article 21 May 2015

Notes

1.
This paper only focuses on JavaScript events and leaves other client side events such as Flash events to the future studies.
2.
http://phantomjs.org/
3.
XMLHttpRequest is the module responsible for asynchronous calls in many popular browsers such as Firefox and Chrome. Microsoft Internet Explorer however does not use module, and instead it uses ActiveXObject.
4.
Due to space limitation rest of code snippets in this section are omitted.
5.
http://www.abeautifulsite.net/blog/2008/03/jquery-file-tree/

References

Amalfitano, D., Fasolino, A.R., Tramontana, P.: Reverse engineering finite state machines from rich internet applications. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08, pp. 69–73. IEEE Computer Society, Washington, DC (2008)
Google Scholar
Amalfitano, D., Fasolino, A.R., Tramontana, P.: Experimenting a reverse engineering technique for modelling the behaviour of rich internet applications. In: IEEE International Conference on Software Maintenance, ICSM 2009, pp. 571–574, September 2009
Google Scholar
Amalftano, D., Fasolino, A.R., Tramontana, P.: Rich internet application testing using execution trace data. In: Proceedings of the 2010 Third International Conference on Software Testing, Verifcation, and Validation Workshops, ICSTW ’10, pp. 274–283. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Benjamin, K., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Some modeling challenges when testing rich internet applications for security. In: Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 403–409. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Benjamin, K., von Bochmann, G., Dincturk, M.E., Jourdan, G.-V., Onut, I.V.: A strategy for efficient crawling of rich internet applications. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 74–89. Springer, Heidelberg (2011)
Chapter Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Proc. Aust. World Wide Web Conf. 34(8), 711–26 (2002)
Google Scholar
Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. In: WWW (Companion Volume), pp. 227–228 (2014). http://doi.acm.org/10.1145/2567948.2577304
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117. Elsevier Science Publishers B.V., Amsterdam (1998)
Google Scholar
Choudhary, S.: M-crawler: crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa (2012). http://ssrg.site.uottawa.ca/docs/Surya-Thesis.pdf
Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G., Onut, I.V.: Building rich internet applications models: example of a better strategy. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 291–305. Springer, Heidelberg (2013)
Chapter Google Scholar
Choudhary, S., Dincturk, M.E., von Bochmann, G., Jourdan, G.-V., Onut, I.-V., Ionescu, P.: Solving some modeling challenges when testing rich internet applications for security. In: ICST, pp. 850–857 (2012)
Google Scholar
Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corpm, Riverton (2012)
Google Scholar
Dincturk, M.E.: Model-based crawling - an approach to design efficient crawling strategies for rich internet applications. Master’s thesis, EECS - University of Ottawa (2013). http://ssrg.eecs.uottawa.ca/docs/Dincturk_MustafaEmre_2013_thesis.pdf
Dincturk, M.E., Choudhary, S., von Bochmann, G., Jourdan, G.-V., Onut, I.V.: A statistical approach for efficient crawling of rich internet applications. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 362–9. Springer, Heidelberg (2012)
Chapter Google Scholar
Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: Ajax crawl: making ajax applications searchable. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89. IEEE Computer Society, Washington, DC (2009)
Google Scholar
Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler (2001)
Google Scholar
Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J.: Geographical partition for distributed web crawling. In: Proceedings of the 2005 workshop on Geographic information retrieval, GIR ’05, pp. 55–60. ACM, New York (2005)
Google Scholar
Frey, G.: Indexing ajax web applications. Master’s thesis, ETH Zurich (2007). http://e-collection.library.ethz.ch/eserv/eth:30111/eth-30111-01.pdf
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2, 219–9 (1999)
Article Google Scholar
Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–15. Springer, Heidelberg (2003)
Chapter Google Scholar
Lo, J., Wohlstadter, E., Mesbah, A.: Imagen: runtime migration of browser sessions for javascript web applications. In: Proceedings of the International World Wide Web Conference (WWW), pp. 815–825. ACM (2013)
Google Scholar
Marchetto, A., Tonella, P., Ricca, F.: State-based testing of ajax web applications. In: Proceedings of the 2008 International Conference on Software Testing, Verifcation, and Validation, ICST ’08, pp. 121–130. IEEE Computer Society, Washington, DC (2008)
Google Scholar
Matter, R.: Ajax crawl: making ajax applications searchable. Master’s thesis, ETH Zurich (2008). http://e-collection.library.ethz.ch/eserv/eth:30709/eth-30709-01.pdf
Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pages 122–134. IEEE Computer Society, Washington, DC (2008)
Google Scholar
Mesbah, A., van Deursen, A., Lenselink, S.: Crawling ajax-based web applications through dynamic analysis of user interface state changes. TWEB 6(1), 3 (2012)
Article Google Scholar
Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: Proceedings of the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), p. 10. IEEE Computer Society (2013)
Google Scholar
Mirtaheri, S.M., Zou, D., Bochmann, G.V., Jourdan, G.-V., Onut, I.V.: Dist-ria crawler: a distributed crawler for rich internet applications. In: Proceedings of the 8th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (2013)
Google Scholar
Peng, Z., He, N., Jiang, C., Li, Z., Xu, L., Li, Y., Ren, Y.: Graph-based ajax crawl: Mining data from rich internet applications. In: 2012 International Conference on Computer Science and Electronics Engineering (ICCSEE), vol. 3, pp. 590–594 (2012)
Google Scholar
Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data, Engineering, pp. 357–368 (2002)
Google Scholar
tsang Lee, H., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond (2008)
Google Scholar

Download references

Acknowledgments

This work is largely supported by the IBM^® Center for Advanced Studies, the IBM Ottawa Lab and the Natural Sciences and Engineering Research Council of Canada (NSERC). A special thank to Sara Baghbanzadeh.

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada
Seyed M. Mirtaheri, Gregor von Bochmann & Guy-Vincent Jourdan
Security AppScan® Enterprise, IBM, 770 Palladium Dr, Ottawa, Ontario, Canada
Iosif Viorel Onut

Authors

Seyed M. Mirtaheri
View author publications
You can also search for this author in PubMed Google Scholar
Gregor von Bochmann
View author publications
You can also search for this author in PubMed Google Scholar
Guy-Vincent Jourdan
View author publications
You can also search for this author in PubMed Google Scholar
Iosif Viorel Onut
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seyed M. Mirtaheri .

Editor information

Editors and Affiliations

Northeastern University, Boston, Massachusetts, USA
Guevara Noubir
Université de Rennes 1, Rennes Cedex, France
Michel Raynal

Trademarks

IBM, the IBM logo, ibm.com and AppScan are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. Intel, and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mirtaheri, S.M., von Bochmann, G., Jourdan, GV., Onut, I.V. (2014). GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications. In: Noubir, G., Raynal, M. (eds) Networked Systems. NETYS 2014. Lecture Notes in Computer Science(), vol 8593. Springer, Cham. https://doi.org/10.1007/978-3-319-09581-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-09581-3_14
Published: 03 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09580-6
Online ISBN: 978-3-319-09581-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Review of Focused Crawling Schemes for Search Engine

A Study on Different Types of Web Crawlers

Focused crawling for the hidden web

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Trademarks

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Review of Focused Crawling Schemes for Search Engine

A Study on Different Types of Web Crawlers

Focused crawling for the hidden web

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Trademarks

Trademarks

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation