Abstract
In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58 % and 90 % on average respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.
References
Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publishing 7(1), 1–17 (2001)
Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep web. J. Data Knowl. Eng. 52(3), 273–311 (2005)
He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: a survey. Commun. ACM 50(5), 94–101 (2007)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proceeding of VLDB, pp. 1241–1252 (2008)
Ipeirotis, P., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. In: proceeding of SIGMOD, pp. 67–68 (2001)
Raghavan, S., Molina, H.G.: Crawling the hidden web. In: Proceeding of the 27th international Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the deep web: present and future. In: Proceeding of CIDR (2009)
He, Y., Xin, D., V, G., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceeding of WSDM 2013, pp. 355–364 (2013)
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceeding of ICDE, pp. 47–56 (2006)
Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB (2002)
Dong, X., Srivastava, D.: Big data integration. In: ICDE, pp. 1245–1248 (2013)
Yang, M., Wang, H., Lim, L., Wang, M.: Optimizing content freshness of relations extracted from the web using keyword search. In: Proceeding of SIGMOND, pp. 819–830 (2010)
Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceeding of Web Intelligence, pp. 718–724 (2008)
Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009)
Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014)
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceeding of SBBD (2004)
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceeding of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Inf. Syst. 38(6), 801–819 (2013)
Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning deep web crawling with diverse featueres. In: WI-IAT, pp. 572–575 (2009)
Dong, Y., Li, Q.: A deep web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)
Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010)
Lu, J.: Ranking bias in deep web size estimation using capture recapture method. Journal of Data and Knowledge Engineering 69(8), 866–879 (2010)
Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. In: WWW, pp. 367–376 (2006)
Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)
Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. J. Quant. Linguist. 2(3), 217–237 (1995)
Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Acknowledgements
This work is supported by NSFC (No.61440020 and No.6130 9029), NSERC, Programs for Innovation Research and 121 Project in Central University of Finance and Economics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, Y., Li, Y., Pi, N., Lu, J. (2015). Crawling Ranked Deep Web Data Sources. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9418. Springer, Cham. https://doi.org/10.1007/978-3-319-26190-4_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-26190-4_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26189-8
Online ISBN: 978-3-319-26190-4
eBook Packages: Computer ScienceComputer Science (R0)