article

QProber: A system for automatic classification of hidden-Web databases

Authors:

Panagiotis G. Ipeirotis,

Mehran SahamiAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 21, Issue 1

Pages 1 - 41

https://doi.org/10.1145/635484.635485

Published: 01 January 2003 Publication History

Abstract

The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web "crawlers." Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.

References

[1]

Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries (DL 2000).

[2]

Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the Nineteenth IEEE International Conference on Data Engineering (ICDE 2003).

[3]

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the Twentieth International Conference on Very Large Databases (VLDB'94), 487--499.

[4]

Apte, C., Damerau, F., and Weiss, S. M. 1994. Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. 12, 3, 233--251.

[5]

Burges, C. J. 1998. A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discov. 2, 2 (June), 121--167.

[6]

Callan, J. and Connell, M. 2001. Query-based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130.

[7]

Callan, J. P., Connell, M., and Du, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99), 479--490.

[8]

Cleverdon, C. W. and Mills, J. 1963. The testing of index language devices. Aslib Proc. 15, 4, 106--130.

[9]

Cohen, W. and Singer, Y. 1996. Learning to query the Web. In AAAI Workshop on Internet-Based Information Systems, 16--25.

[10]

Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96), 709--716.

[11]

Craswell, N., Bailey, P., and Hawking, D. 2000. Server selection on the World Wide Web. In Proceedings of the Fifth ACM Conference on Digital Libraries (DL 2000), 37--46.

[12]

Craven, M. 1996. Extracting comprehensible models from trained neural networks. PhD Thesis, University of Wisconsin-Madison, Department of Computer Sciences. Also appears as UW Tech. Rep. CS-TR-96-1326.

[13]

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.

[14]

Dolin, R., Agrawal, D., and El Abbadi, A. 1999. Scalable collection summarization and selection. In Proceedings of the Fourth ACM International Conference on Digital Libraries (DL'99), 49--58.

[15]

Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York.

[16]

Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, 148--155.

[17]

Flake, G., Glover, E., Lawrence, S., and Giles, C. 2002. Extracting query modifications from nonlinear SVMs. In Proceedings of the Eleventh International World Wide Web Conference (WWW11).

[18]

Gauch, S., Wang, G., and Gomez, M. 1996. ProFusion&ast;: Intelligent fusion from multiple, distributed search engines. J. Univ. Comput. Sci. 2, 9 (Sept.), 637--649.

[19]

Ghani, R., Jones, R., and Mladenic, D. 2001. Using the Web to create minority language corpora. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, 279--286.

[20]

Gravano, L., García-Molina, H., and Tomasic, A. 1999. GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24, 2 (June), 229--264.

[21]

Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.

[22]

Grefenstette, G. and Nioche, J. 2000. Estimation of English and non-English language use on the WWW. In Recherche d'Information Assistée par Ordinateur (RIAO 2000).

[23]

Hawking, D. and Thistlewaite, P. B. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40--76.

[24]

Ipeirotis, P. G. and Gravano, L. 2002. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002).

[25]

Ipeirotis, P. G., Gravano, L., and Sahami, M. 2001a. PERSIVAL demo: Categorizing hidden-Web resources. In Proceedings of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL 2001), 454.

[26]

Ipeirotis, P. G., Gravano, L., and Sahami, M. 2001b. Probe, count, and classify: Categorizing hidden-Web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD 2001), 67--78.

[27]

Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In ECML-98, Tenth European Conference on Machine Learning, 137--142.

[28]

Johnston, R. 1971. Gershgorin theorems for partitioned matrices. Lin. Algeb. Appl. 4, 3 (July), 205--220.

[29]

Kohavi, R. and John, G. H. 1997. Wrappers for feature subset selection. Artif. Intell. 97, 1--2, (special issue on Relevance), 273--323.

[30]

Kohavi, R. and Provost, F. 1998. Glossary of terms. J. Mach. Learn. 30, 2/3, 271--274. Editorial for the special issue on Applications of Machine Learning and the Knowledge Discovery Process.

[31]

Koller, D. and Sahami, M. 1996. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning (ICML'96), 284--292.

[32]

Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 170--178.

[33]

Koster, M. 2002. Robots exclusion standard. Available at http://www.robotstxt.org/.

[34]

Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'96, 298--306.

[35]

McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, 41--48.

[36]

Meng, W., Liu, K.-L., Yu, C. T., Wang, X., Chang, Y., and Rishe, N. 1998. Determining text databases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Databases (VLDB'98), 14--25.

[37]

Meng, W., Yu, C. T., and Liu, K.-L. 1999. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems (CoopIS 1999), 22--33.

[38]

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York.

[39]

Nilsson, N. J. 1990. The Mathematical Foundations of Learning Machines. Morgan-Kaufmann, San Francisco. Previously published as: Learning Machines, 1965.

[40]

Perkowitz, M., Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. Learning to understand information on the Internet: An example-based approach. J. Intell. Inf. Syst. 8, 2 (Mar.), 133--153.

[41]

Quinlan, J. 1992. C4.5: Programs for Machine Learning. Morgan-Kaufmann, San Francisco.

[42]

Raghavan, S. and García-Molina, H. 2001. Crawling the hidden Web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), 129--138.

[43]

Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Information Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 313--323.

[44]

Sahami, M. 1998. Using machine learning to improve information access. PhD Thesis, Stanford University, Computer Science Department.

[45]

Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513--523.

[46]

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.

[47]

Salton, G. and McGill, M. J. 1997. The SMART and SIRE experimental retrieval systems. In Readings in Information Retrieval. Morgan-Kaufmann, San Francisco, 381--399.

[48]

Schuetze, H., Hull, D., and Pedersen, J. 1995. A comparison of document representations and classifiers for the routing problem. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'95, 229--237.

[49]

Sugiura, A. and Etzioni, O. 2000. Query routing for Web search engines: Architecture and experiments. In Proceedings of the Ninth International World Wide Web Conference (WWW9).

[50]

van Rijsbergen, K. 1979. Information Retrieval (2nd edition). Butterworths, London.

[51]

Wang, W., Meng, W., and Yu, C. 2000. Concept hierarchy based text database categorization in a metasearch engine environment. In Proceedings of the First International Conference on Web Information Systems Engineering (WISE 2000), 283--290.

[52]

Xu, J. and Callan, J. P. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, 112--120.

[53]

Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'99, 42--49.

[54]

Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).

[55]

Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.

Cited By

Gupta SBhatia K(2021)Design of a Parallel and Scalable Crawler for the Hidden WebInternational Journal of Information Retrieval Research10.4018/IJIRR.28961212:1(1-23)Online publication date: 15-Oct-2021
https://doi.org/10.4018/IJIRR.289612
Gupta SBhatia K(2018)Optimal Query Generation for Hidden Web Extraction Through Response AnalysisThe Dark Web10.4018/978-1-5225-3163-0.ch005(65-83)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-3163-0.ch005
Ipeirotis P(2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_327
Show More Cited By

Index Terms

Recommendations

Automatic classification of web databases using domain-dictionaries
MLDM'13: Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

The identification, classification and integration of databases on the Web (also called web databases) as information sources is still a great challenge due to their constantly growing and diversification. The classification of such web databases ...
An Improved Database Classification Algorithm for Multi-database Mining
FAW '09: Proceedings of the 3d International Workshop on Frontiers in Algorithmics

Database classification is a data preprocessing technique for multi-database mining. To reduce search costs in the data from all databases, we need to identify those databases which are most likely relevant to a data mining application. Based on the ...
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 21, Issue 1

January 2003

131 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/635484

Issue’s Table of Contents

Copyright © 2003 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2003

Published in TOIS Volume 21, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

98
Total Citations
View Citations
1,895
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gupta SBhatia K(2021)Design of a Parallel and Scalable Crawler for the Hidden WebInternational Journal of Information Retrieval Research10.4018/IJIRR.28961212:1(1-23)Online publication date: 15-Oct-2021
https://doi.org/10.4018/IJIRR.289612
Gupta SBhatia K(2018)Optimal Query Generation for Hidden Web Extraction Through Response AnalysisThe Dark Web10.4018/978-1-5225-3163-0.ch005(65-83)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-3163-0.ch005
Ipeirotis P(2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_327
Rezk EAqle AJaoua ADas GZhang N(2017)Optimized Processing of a Batch of Aggregate Queries over Hidden Databases2017 International Conference on Computer and Applications (ICCA)10.1109/COMAPP.2017.8079754(317-324)Online publication date: Sep-2017
https://doi.org/10.1109/COMAPP.2017.8079754
Barrio PGravano L(2017)Sampling strategies for information extraction over the deep webInformation Processing and Management: an International Journal10.1016/j.ipm.2016.11.00653:2(309-331)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.ipm.2016.11.006
Salampasis M(2017)Federated Patent SearchCurrent Challenges in Patent Information Retrieval10.1007/978-3-662-53817-3_8(213-240)Online publication date: 26-Mar-2017
https://doi.org/10.1007/978-3-662-53817-3_8
Kumar MBhatia RRattan D(2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
https://doi.org/10.1002/widm.1218
Ipeirotis P(2016)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_327-2(1-4)Online publication date: 9-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_327-2
Ghansah BWu S(2015)Distributed Information Retrieval: Developments and StrategiesInternational Journal of Engineering Research in Africa10.4028/www.scientific.net/JERA.16.11016(110-144)Online publication date: Jun-2015
https://doi.org/10.4028/www.scientific.net/JERA.16.110
Barrio PGravano LDevelder CBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Ranking Deep Web Text Collections for Scalable Information ExtractionProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806581(153-162)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806581
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents