Ziyan Zhou

Rochester Institute of Technology, Computer Science, Alumnus

Followers

Following

Public Views

Interests

Uploads

Papers by Ziyan Zhou

On-Demand Scalable Timer Wheel

Drafts by Ziyan Zhou

Web Content Extraction Through Machine Learning

Web content extraction is a key technology for enabling an array of applications aimed at underst... more Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and dat-apoints in the best cluster are selected as positive training examples.

Download