Google Scholar

IEPAD: Information extraction based on pattern discovery

CH Chang, SC Lui - Proceedings of the 10th international conference on …, 2001 - dl.acm.org

Proceedings of the 10th international conference on World Wide Web, 2001•dl.acm.org

Abstract

The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semistructured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users’ training example. In this paper, we propose IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines.

ACM Digital Library

Show moreShow less

Save Cite Cited by 770 Related articles All 15 versions

Cite

Advanced search

Saved to My library

IEPAD: Information extraction based on pattern discovery