Abstract
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
N. Ashish and C. Knoblock, "Semi-automatic wrapper generation for Internet information sources," in Proc. Cooperative Inform. Syst. 1997, pp. 160-169.
P. Atzeni and G. Mecca, "Cut and paste," in Proc. 16th ACM SIGMOD Symp. Principles of Database Syst. 1997, pp. 144-153.
P. Atzeni G. Mecca, and P. Merialdo, "Semi-structured and structured data in the Web: going back and forth," in Proc ACM SIGMOD workshop on Management of Semi-structured Data, 1997, pp. 1-9.
M. Califf and R. Mooney, "Relational learning of pattern-match rules for information extraction," in Proc. Sixteenth Natl. Conf. Artif. Intell. (AAAI-99), 1999, pp. 328-334.
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, "The TSIMMIS project: integration of heterogeneous information sources," in Proc. 10th Meeting of the Informat. Processing Soc. Jpn., 1994, pp. 7-18.
B. Chidlovskii, U. Borghoff, and P. Chevalie, "Towards sophisticated wrapping of Web-based information repositories," in Proc. 5th Int. RIAO Conf., 1997, pp. 123-35.
W. Cohen "A Web-based information system that reasons with structured collections of text," in Proc. Second Int. Conf. Autonomous Agents (AA-98), 1998, pp. 400-407. 114 muslea, minton and knoblock
D. Freitag, "Information extraction from HTML: application of a general learning approach," in Proc. 15th Conf. Artif. Intell. (AAAI-98), 1998, pp. 517-523.
C. Hsu and M. Dung, "Generating nite-state transducers for semi-structured data extraction from the Web," J. Infom. Syst. vol. 23, no. 8, pp. 521-538, 1998.
T. Kirk, A. Levy, Y. Sagiv, and D. Srivastava, "The information manifold," in Proc. AAAI Spring Symp.: Inf. Gathering from Heterogeneous Distributed Environments, 1995, pp. 85-91.
C. Knoblock, S. Minton, J. Ambite, N. Ashish, J. Margulis, J. Modi, I. Muslea, A. Philpot, and S. Tejada, "Modeling web sources for information integration," in Proc. 15th Natl. Conf. Artif. Intell. (AAAI-98), 1998, pp. 211-218.
N. Kushmerick, "Wrapper induction for information extraction," Ph.D. thesis, Department of Computer Science, University of Washington, TR UW-CSE-97-11-04, 1997.
T. Raychaudhuri and L. Hamey, "Active learning-approaches and issues," J. Intell. Syst. vol. 7, pp. 205-243, 1997.
R. L. Rivest, "Learning decision lists," Mach. Learn. vol. 2, no. 3, pp. 229-246, 1987.
S. Soderland, "Learning information extraction rules for semi-structured and free text," Mach. Learn. vol. 34, no. 1/2/3, pp. 233-272, 1999.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Muslea, I., Minton, S. & Knoblock, C.A. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001). https://doi.org/10.1023/A:1010022931168
Issue Date:
DOI: https://doi.org/10.1023/A:1010022931168