Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2076623.2076646acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Databases on the web: national web domain survey

Published: 21 September 2011 Publication History

Abstract

The deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are disputable. In this paper, we address the problem of accurate estimation of the deep Web by sampling one national web domain. We report some of our results obtained when surveying the Russian Web. The survey findings, namely the size estimates of the deep Web, could be useful for further studies to handle data in the deep Web.

References

[1]
Internet Archive's snapshot of Yandex statistics page as of September 18, 2006. http://web.archive.org/web/20060918081218/http://company.yandex.ru/.
[2]
Runet in March 2006: domains, hosting, geographical location. http://www.rukv.ru/analytics-200603.html. In Russian.
[3]
Runet in March 2007: domains, hosting, geographical location. http://www.rukv.ru/runet-2007.html. In Russian.
[4]
April 2004 Web Server Survey. http://news.netcraft.com/archives/2004/04/01/april_2004_web_server_survey.html, April 2004.
[5]
DNS load balancing report. http://www.securityspace.com/s_survey/data/man.200404/dnsmult.html, April 2004.
[6]
R. Baeza-Yates and C. Castillo. Crawling the infinite Web: five levels are enough. In Proceedings of the third Workshop on Web Graphs (WAW), pages 156--167, 2004.
[7]
R. Baeza-Yates, C. Castillo, and E. N. Efthimiadis. Characterization of national Web domains. ACM Trans. Internet Technol., 7(2), 2007.
[8]
R. Baeza-Yates, C. Castillo, and V. López. Characteristics of the Web of Spain. Cybermetrics, 9(1), 2005.
[9]
M. Bergman. The deep Web: surfacing hidden value. Journal of Electronic Publishing, 7(1), 2001.
[10]
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30(1--7):379--388, 1998.
[11]
K. Bharat, A. Broder, J. Dean, and M. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inf. Sci., 51(12):1114--1122, 2000.
[12]
M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the Web. Commun. ACM, 54:72--79, 2011.
[13]
K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the Web: observations and implications. SIGMOD Rec., 33(3):61--70, 2004.
[14]
D. Gomes and M. J. Silva. Characterizing a national community web. ACM Trans. Internet Technol., 5(3):508--531, 2005.
[15]
C. A. Lynch. The Z39.50 information retrieval protocol: an overview and status report. SIGCOMM Comput. Commun. Rev., 21(1):58--70, 1991.
[16]
E. T. O'Neill, P. D. McClain, and B. F. Lavoie. A methodology for sampling the World Wide Web. Annual Review of OCLC Research 1997, 1997.
[17]
D. Shestakov. Deep Web: databases on the Web. In: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581--588, IGI Global (2009).
[18]
D. Shestakov. Sampling the national deep Web. In Proceedings of DEXA 2011, pages 331--340, 2011.
[19]
D. Shestakov and T. Salakoski. On estimating the scale of national deep Web. In Proceedings of DEXA'07, pages 780--789, 2007.
[20]
D. Shestakov and T. Salakoski. Characterization of national deep Web. Technical Report 892, Turku Centre for Computer Science, May 2008.
[21]
G. Tolosa, F. Bordignon, R. Baeza-Yates, and C. Castillo. Characterization of the Argentinian Web. Cybermetrics, 11(1), 2007.

Cited By

View all
  • (2019)2 Way CrawlingInternational Journal of Applied Evolutionary Computation10.4018/IJAEC.201907010510:3(34-39)Online publication date: 1-Jul-2019
  • (2018)The Evolution of the (Hidden) Web and Its Hidden DataThe Dark Web10.4018/978-1-5225-3163-0.ch006(84-113)Online publication date: 2018
  • (2018)Advanced Web Crawler For Deep Web Interface Using Binary Vector & Page Rank2018 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 2018 2nd International Conference on10.1109/I-SMAC.2018.8653765(500-503)Online publication date: Aug-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications
September 2011
274 pages
ISBN:9781450306270
DOI:10.1145/2076623
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cluster random sampling
  2. deep web
  3. national web
  4. structured data
  5. virtual hosting
  6. web characterization
  7. web databases
  8. web measurement

Qualifiers

  • Research-article

Conference

IDEAS '11

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)2 Way CrawlingInternational Journal of Applied Evolutionary Computation10.4018/IJAEC.201907010510:3(34-39)Online publication date: 1-Jul-2019
  • (2018)The Evolution of the (Hidden) Web and Its Hidden DataThe Dark Web10.4018/978-1-5225-3163-0.ch006(84-113)Online publication date: 2018
  • (2018)Advanced Web Crawler For Deep Web Interface Using Binary Vector & Page Rank2018 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 2018 2nd International Conference on10.1109/I-SMAC.2018.8653765(500-503)Online publication date: Aug-2018
  • (2017)Content extraction from deep web interfaces2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2017.8203702(349-353)Online publication date: Apr-2017
  • (2016)SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web InterfacesIEEE Transactions on Services Computing10.1109/TSC.2015.24149319:4(608-620)Online publication date: 1-Jul-2016
  • (2016)Smart crawler for hidden web interfaces2016 Online International Conference on Green Engineering and Technologies (IC-GET)10.1109/GET.2016.7916710(1-4)Online publication date: Nov-2016
  • (2015)The Evolution of the (Hidden) Web and its Hidden DataDesign Strategies and Innovations in Multimedia Presentations10.4018/978-1-4666-8696-0.ch001(1-30)Online publication date: 2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media