Crawling Ranked Deep Web Data Sources

Conference paper
First Online: 25 December 2015

pp 384–398
Cite this conference paper

Web Information Systems Engineering – WISE 2015 (WISE 2015)

Yan Wang²⁰,
Yaxin Li²⁰,
Nannan Pi²⁰ &
…
Jianguo Lu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9418))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1550 Accesses
1 Citations

Abstract

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58 % and 90 % on average respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Crawling ranked deep Web data sources

Article 03 September 2016

Efficiently harvesting deep web interfaces based on adaptive learning using two-phase data crawler framework

Article 06 May 2021

TS-IDS Algorithm for Query Selection in the Deep Web Crawling

Chapter © 2014

Notes

1.
In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.

References

Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publishing 7(1), 1–17 (2001)
Article Google Scholar
Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep web. J. Data Knowl. Eng. 52(3), 273–311 (2005)
Article Google Scholar
He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: a survey. Commun. ACM 50(5), 94–101 (2007)
Article Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proceeding of VLDB, pp. 1241–1252 (2008)
Google Scholar
Ipeirotis, P., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. In: proceeding of SIGMOD, pp. 67–68 (2001)
Google Scholar
Raghavan, S., Molina, H.G.: Crawling the hidden web. In: Proceeding of the 27th international Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)
Google Scholar
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)
Chapter Google Scholar
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the deep web: present and future. In: Proceeding of CIDR (2009)
Google Scholar
He, Y., Xin, D., V, G., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceeding of WSDM 2013, pp. 355–364 (2013)
Google Scholar
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceeding of ICDE, pp. 47–56 (2006)
Google Scholar
Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB (2002)
Google Scholar
Dong, X., Srivastava, D.: Big data integration. In: ICDE, pp. 1245–1248 (2013)
Google Scholar
Yang, M., Wang, H., Lim, L., Wang, M.: Optimizing content freshness of relations extracted from the web using keyword search. In: Proceeding of SIGMOND, pp. 819–830 (2010)
Google Scholar
http://www.dmoz.org
Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceeding of Web Intelligence, pp. 718–724 (2008)
Google Scholar
Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009)
Chapter Google Scholar
Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014)
Google Scholar
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceeding of SBBD (2004)
Google Scholar
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceeding of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)
Google Scholar
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Inf. Syst. 38(6), 801–819 (2013)
Article Google Scholar
Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning deep web crawling with diverse featueres. In: WI-IAT, pp. 572–575 (2009)
Google Scholar
Dong, Y., Li, Q.: A deep web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)
Google Scholar
Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010)
Chapter Google Scholar
Lu, J.: Ranking bias in deep web size estimation using capture recapture method. Journal of Data and Knowledge Engineering 69(8), 866–879 (2010)
Article Google Scholar
Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. In: WWW, pp. 367–376 (2006)
Google Scholar
Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)
Article MATH MathSciNet Google Scholar
Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. J. Quant. Linguist. 2(3), 217–237 (1995)
Article Google Scholar
Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Google Scholar

Download references

Acknowledgements

This work is supported by NSFC (No.61440020 and No.6130 9029), NSERC, Programs for Innovation Research and 121 Project in Central University of Finance and Economics.

Author information

Authors and Affiliations

School of Information, Central University of Finance and Economics, Beijing, China
Yan Wang, Yaxin Li & Nannan Pi
School of Computer Science, University of Windsor, Windsor, Canada
Jianguo Lu

Authors

Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yaxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Nannan Pi
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Wang .

Editor information

Editors and Affiliations

Tsinghua University, Bijing, China
Jianyong Wang
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
Florida Atlantic University, Boca Raton, Florida, USA
Dingding Wang
Victoria University, Melbourne, Australia
Hua Wang
School of Computing & Information, Florida International University, Miami, Florida, USA
Shu-Ching Chen
Florida International University, Miami, Florida, USA
Tao Li
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, Y., Li, Y., Pi, N., Lu, J. (2015). Crawling Ranked Deep Web Data Sources. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9418. Springer, Cham. https://doi.org/10.1007/978-3-319-26190-4_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-26190-4_26
Published: 25 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26189-8
Online ISBN: 978-3-319-26190-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions