Web content extraction is a key technology for enabling an array of applications aimed at underst... more Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and dat-apoints in the best cluster are selected as positive training examples.
Web content extraction is a key technology for enabling an array of applications aimed at underst... more Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and dat-apoints in the best cluster are selected as positive training examples.
Uploads
Papers by Ziyan Zhou
Drafts by Ziyan Zhou