Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

QProber: A system for automatic classification of hidden-Web databases

Published: 01 January 2003 Publication History
  • Get Citation Alerts
  • Abstract

    The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web "crawlers." Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.

    References

    [1]
    Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries (DL 2000).
    [2]
    Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the Nineteenth IEEE International Conference on Data Engineering (ICDE 2003).
    [3]
    Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the Twentieth International Conference on Very Large Databases (VLDB'94), 487--499.
    [4]
    Apte, C., Damerau, F., and Weiss, S. M. 1994. Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. 12, 3, 233--251.
    [5]
    Burges, C. J. 1998. A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discov. 2, 2 (June), 121--167.
    [6]
    Callan, J. and Connell, M. 2001. Query-based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130.
    [7]
    Callan, J. P., Connell, M., and Du, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99), 479--490.
    [8]
    Cleverdon, C. W. and Mills, J. 1963. The testing of index language devices. Aslib Proc. 15, 4, 106--130.
    [9]
    Cohen, W. and Singer, Y. 1996. Learning to query the Web. In AAAI Workshop on Internet-Based Information Systems, 16--25.
    [10]
    Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96), 709--716.
    [11]
    Craswell, N., Bailey, P., and Hawking, D. 2000. Server selection on the World Wide Web. In Proceedings of the Fifth ACM Conference on Digital Libraries (DL 2000), 37--46.
    [12]
    Craven, M. 1996. Extracting comprehensible models from trained neural networks. PhD Thesis, University of Wisconsin-Madison, Department of Computer Sciences. Also appears as UW Tech. Rep. CS-TR-96-1326.
    [13]
    Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.
    [14]
    Dolin, R., Agrawal, D., and El Abbadi, A. 1999. Scalable collection summarization and selection. In Proceedings of the Fourth ACM International Conference on Digital Libraries (DL'99), 49--58.
    [15]
    Duda, R. O. and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York.
    [16]
    Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, 148--155.
    [17]
    Flake, G., Glover, E., Lawrence, S., and Giles, C. 2002. Extracting query modifications from nonlinear SVMs. In Proceedings of the Eleventh International World Wide Web Conference (WWW11).
    [18]
    Gauch, S., Wang, G., and Gomez, M. 1996. ProFusion*: Intelligent fusion from multiple, distributed search engines. J. Univ. Comput. Sci. 2, 9 (Sept.), 637--649.
    [19]
    Ghani, R., Jones, R., and Mladenic, D. 2001. Using the Web to create minority language corpora. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, 279--286.
    [20]
    Gravano, L., García-Molina, H., and Tomasic, A. 1999. GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24, 2 (June), 229--264.
    [21]
    Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.
    [22]
    Grefenstette, G. and Nioche, J. 2000. Estimation of English and non-English language use on the WWW. In Recherche d'Information Assistée par Ordinateur (RIAO 2000).
    [23]
    Hawking, D. and Thistlewaite, P. B. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40--76.
    [24]
    Ipeirotis, P. G. and Gravano, L. 2002. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002).
    [25]
    Ipeirotis, P. G., Gravano, L., and Sahami, M. 2001a. PERSIVAL demo: Categorizing hidden-Web resources. In Proceedings of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL 2001), 454.
    [26]
    Ipeirotis, P. G., Gravano, L., and Sahami, M. 2001b. Probe, count, and classify: Categorizing hidden-Web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD 2001), 67--78.
    [27]
    Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In ECML-98, Tenth European Conference on Machine Learning, 137--142.
    [28]
    Johnston, R. 1971. Gershgorin theorems for partitioned matrices. Lin. Algeb. Appl. 4, 3 (July), 205--220.
    [29]
    Kohavi, R. and John, G. H. 1997. Wrappers for feature subset selection. Artif. Intell. 97, 1--2, (special issue on Relevance), 273--323.
    [30]
    Kohavi, R. and Provost, F. 1998. Glossary of terms. J. Mach. Learn. 30, 2/3, 271--274. Editorial for the special issue on Applications of Machine Learning and the Knowledge Discovery Process.
    [31]
    Koller, D. and Sahami, M. 1996. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning (ICML'96), 284--292.
    [32]
    Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 170--178.
    [33]
    Koster, M. 2002. Robots exclusion standard. Available at http://www.robotstxt.org/.
    [34]
    Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'96, 298--306.
    [35]
    McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, 41--48.
    [36]
    Meng, W., Liu, K.-L., Yu, C. T., Wang, X., Chang, Y., and Rishe, N. 1998. Determining text databases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Databases (VLDB'98), 14--25.
    [37]
    Meng, W., Yu, C. T., and Liu, K.-L. 1999. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems (CoopIS 1999), 22--33.
    [38]
    Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York.
    [39]
    Nilsson, N. J. 1990. The Mathematical Foundations of Learning Machines. Morgan-Kaufmann, San Francisco. Previously published as: Learning Machines, 1965.
    [40]
    Perkowitz, M., Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. Learning to understand information on the Internet: An example-based approach. J. Intell. Inf. Syst. 8, 2 (Mar.), 133--153.
    [41]
    Quinlan, J. 1992. C4.5: Programs for Machine Learning. Morgan-Kaufmann, San Francisco.
    [42]
    Raghavan, S. and García-Molina, H. 2001. Crawling the hidden Web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), 129--138.
    [43]
    Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Information Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 313--323.
    [44]
    Sahami, M. 1998. Using machine learning to improve information access. PhD Thesis, Stanford University, Computer Science Department.
    [45]
    Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513--523.
    [46]
    Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.
    [47]
    Salton, G. and McGill, M. J. 1997. The SMART and SIRE experimental retrieval systems. In Readings in Information Retrieval. Morgan-Kaufmann, San Francisco, 381--399.
    [48]
    Schuetze, H., Hull, D., and Pedersen, J. 1995. A comparison of document representations and classifiers for the routing problem. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'95, 229--237.
    [49]
    Sugiura, A. and Etzioni, O. 2000. Query routing for Web search engines: Architecture and experiments. In Proceedings of the Ninth International World Wide Web Conference (WWW9).
    [50]
    van Rijsbergen, K. 1979. Information Retrieval (2nd edition). Butterworths, London.
    [51]
    Wang, W., Meng, W., and Yu, C. 2000. Concept hierarchy based text database categorization in a metasearch engine environment. In Proceedings of the First International Conference on Web Information Systems Engineering (WISE 2000), 283--290.
    [52]
    Xu, J. and Callan, J. P. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'98, 112--120.
    [53]
    Yang, Y. and Liu, X. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'99, 42--49.
    [54]
    Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).
    [55]
    Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 21, Issue 1
    January 2003
    131 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/635484
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2003
    Published in TOIS Volume 21, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Database classification
    2. Web databases
    3. hidden Web

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Design of a Parallel and Scalable Crawler for the Hidden WebInternational Journal of Information Retrieval Research10.4018/IJIRR.28961212:1(1-23)Online publication date: 15-Oct-2021
    • (2018)Optimal Query Generation for Hidden Web Extraction Through Response AnalysisThe Dark Web10.4018/978-1-5225-3163-0.ch005(65-83)Online publication date: 2018
    • (2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
    • (2017)Optimized Processing of a Batch of Aggregate Queries over Hidden Databases2017 International Conference on Computer and Applications (ICCA)10.1109/COMAPP.2017.8079754(317-324)Online publication date: Sep-2017
    • (2017)Sampling strategies for information extraction over the deep webInformation Processing and Management: an International Journal10.1016/j.ipm.2016.11.00653:2(309-331)Online publication date: 1-Mar-2017
    • (2017)Federated Patent SearchCurrent Challenges in Patent Information Retrieval10.1007/978-3-662-53817-3_8(213-240)Online publication date: 26-Mar-2017
    • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
    • (2016)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_327-2(1-4)Online publication date: 9-Dec-2016
    • (2015)Distributed Information Retrieval: Developments and StrategiesInternational Journal of Engineering Research in Africa10.4028/www.scientific.net/JERA.16.11016(110-144)Online publication date: Jun-2015
    • (2015)Ranking Deep Web Text Collections for Scalable Information ExtractionProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806581(153-162)Online publication date: 17-Oct-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media