IEPAD: Information extraction based on pattern discovery

CH Chang, SC Lui - Proceedings of the 10th international conference on …, 2001 - dl.acm.org
CH Chang, SC Lui
Proceedings of the 10th international conference on World Wide Web, 2001dl.acm.org
The research in information extraction (IE) regards the generation of wrappers that can
extract particular information from semistructured Web documents. Similar to compiler
generation, the extractor is actually a driver program, which is accompanied with the
generated extraction rule. Previous work in this field aims to learn extraction rules from
users' training example. In this paper, we propose IEPAD, a system that automatically
discovers extraction rules from Web pages. The system can automatically identify record …
Abstract
The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semistructured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users’ training example. In this paper, we propose IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.
ACM Digital Library