We present a system that gathers and analyzes online discussion as it relates to consumer product... more We present a system that gathers and analyzes online discussion as it relates to consumer products. Weblogs and online message boards provide forums that record the voice of the public. Woven into this discussion is a wide range of opinion and commentary about consumer products. Given its volume, format and content, the appropriate approach to understanding this data is large-scale web and text data mining. By using a wide variety of state-of-the-art techniques including crawling, wrapping, text classification and computational linguistics, online discussion is gathered and annotated within a framework that provides for interactive analysis that yields marketing intelligence for our customers.
We describe a prototype system for assigning table cells to their proper place in the logical str... more We describe a prototype system for assigning table cells to their proper place in the logical structure of the table, based on a simple model of table structure combined with a number of measures of cohesion between cells. A framework is presented for examining the effect of particular variables on the performance of the system, and preliminary results are presented showing the effect of cohesion measures based on the simplest domain-independent analyses, with the aim allowing future comparison with more knowledge-intensive analyses based on Natural Language Processing. These baseline results suggest that very simple string-based cohesion measures are not sufficient to support the extraction of tuples as we require. Future work will pursue the aim of more adequate approximations to a notional subtype/supertype definition of the relationship between value cells and label cell. 1 Introduction Most representations of tabular data are layout or geometry-based; this is the case whether t...
Retrieving documents by subject matter is the general goal of information retrieval and other con... more Retrieving documents by subject matter is the general goal of information retrieval and other content access systems. There are other aspects of textual content, however, which form equally valid selection critieria. One such aspect is that of sentiment or polarity- indicating the users opinion or emotional relationship with some topic. Recent work in this area has treated polarity effectively as a discrete aspect of text. In this paper we present a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic. 1.
Using data gathered from blogs, this work seeks to understand the structure and formation of soci... more Using data gathered from blogs, this work seeks to understand the structure and formation of social networks, and the patterns of information propagation through these networks. Blogs have become an important medium of communication and information on the World Wide Web. Due to their
Weblogs and message boards provide online forums for discussion that record the voice of the publ... more Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing this unsolicited feedback. Given the volume, format and content of the data, the appropriate approach to understand this data is to use large-scale web and text data mining technologies. This paper argues that applications for mining large volumes of textual data for marketing intelligence should provide two key elements: a suite of powerful mining and visualization technologies and an interactive analysis environment which allows for rapid generation and testing of hypotheses. This paper presents such a system that gathers and annotates online discussion relating to consumer products using a wide variety of state-of-the-art techniques, including crawling, wrapping, search, text classifi...
This paper describes the collection of a corpus of documents that contain one or more tables. Som... more This paper describes the collection of a corpus of documents that contain one or more tables. Some results are then presented which go some way to characterising the table in terms of its relationship to the content of the document it appears in.
Over the past few years, weblogs have emerged as a new communication and publication medium on th... more Over the past few years, weblogs have emerged as a new communication and publication medium on the Internet. In this paper, we describe the application of data mining, information extraction and NLP algorithms for discovering trends across our subset of approximately 100,000 weblogs. We publish daily lists of key persons, key phrases, and key paragraphs to a public web site, BlogPulse.com. In addition, we maintain a searchable index of weblog entries. On top of the search index, we have implemented trend search, which graphs the normalized trend line over time for a search query and provides a way to estimate the relative buzz of word of mouth for given topics over time.
Many social media platforms allow the user to provide profile information. This information is ge... more Many social media platforms allow the user to provide profile information. This information is generally presented in a semi-structured manner either on a profile page or on the weblog home page itself. This paper describes a novel wrapper induction method that extracts profile data. Our ultimate goal is to estimate the geographic distribution of weblog authors and to that end we provide an analysis of the location information discovered for each author in a large database of weblog posts.
Usenet is a decentralized discussion community predating blogs by decades. Just as there are a wi... more Usenet is a decentralized discussion community predating blogs by decades. Just as there are a wide range of political blogs, many Usenet sub-communities focus on politics. However, we find these two communities are very different in terms of source content. In this work we compare linking patterns of the political Usenet with a well-known political blog and news aggregator, Memeorandum, with respect to coverage of news and blog sources, attention of stories, and timeliness of links.
This document provides a L ATEX2Ç« sample (of the format- ting) of a paper for the International C... more This document provides a L ATEX2Ç« sample (of the format- ting) of a paper for the International Conference on Weblogs and Social Media (icwsm). Authors need to compile it us- ing L ATEX2Ç« and BibTeX. The formatting is adapted from the ACM's stylesheet sig-alternate.cls. The current doc- ument also uses some of the wording/examples/figures from the ACM's Alternate ACM SIG Proceedings Paper in LaTeX Format. The developers of the ACM stylesheet have made an effort to include lots of "goodies", such as a subtitle, footnotes on title, subtitle and authors, as well as in the text, and optional components (e.g., Appendices), not to mention examples of equations, theorems, tables and figures. If the abstract is longer than half the textheight perhaps you are trying to say too much.
Here we report some surprising findings of the blog linking and information propagation structure... more Here we report some surprising findings of the blog linking and information propagation structure, after we analyzed one of the largest available datasets, with 45,000 blogs and ~ 2.2 million blog-postings. Our analysis also sheds light on how rumors, viruses, and ideas propagate over social and computer networks. We also present a simple model that mimics the spread of information on the blogosphere, and produces information cascades very similar to those found in real life.
We present a system that gathers and analyzes online discussion as it relates to consumer product... more We present a system that gathers and analyzes online discussion as it relates to consumer products. Weblogs and online message boards provide forums that record the voice of the public. Woven into this discussion is a wide range of opinion and commentary about consumer products. Given its volume, format and content, the appropriate approach to understanding this data is large-scale web and text data mining. By using a wide variety of state-of-the-art techniques including crawling, wrapping, text classification and computational linguistics, online discussion is gathered and annotated within a framework that provides for interactive analysis that yields marketing intelligence for our customers.
We describe a prototype system for assigning table cells to their proper place in the logical str... more We describe a prototype system for assigning table cells to their proper place in the logical structure of the table, based on a simple model of table structure combined with a number of measures of cohesion between cells. A framework is presented for examining the effect of particular variables on the performance of the system, and preliminary results are presented showing the effect of cohesion measures based on the simplest domain-independent analyses, with the aim allowing future comparison with more knowledge-intensive analyses based on Natural Language Processing. These baseline results suggest that very simple string-based cohesion measures are not sufficient to support the extraction of tuples as we require. Future work will pursue the aim of more adequate approximations to a notional subtype/supertype definition of the relationship between value cells and label cell. 1 Introduction Most representations of tabular data are layout or geometry-based; this is the case whether t...
Retrieving documents by subject matter is the general goal of information retrieval and other con... more Retrieving documents by subject matter is the general goal of information retrieval and other content access systems. There are other aspects of textual content, however, which form equally valid selection critieria. One such aspect is that of sentiment or polarity- indicating the users opinion or emotional relationship with some topic. Recent work in this area has treated polarity effectively as a discrete aspect of text. In this paper we present a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic. 1.
Using data gathered from blogs, this work seeks to understand the structure and formation of soci... more Using data gathered from blogs, this work seeks to understand the structure and formation of social networks, and the patterns of information propagation through these networks. Blogs have become an important medium of communication and information on the World Wide Web. Due to their
Weblogs and message boards provide online forums for discussion that record the voice of the publ... more Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing this unsolicited feedback. Given the volume, format and content of the data, the appropriate approach to understand this data is to use large-scale web and text data mining technologies. This paper argues that applications for mining large volumes of textual data for marketing intelligence should provide two key elements: a suite of powerful mining and visualization technologies and an interactive analysis environment which allows for rapid generation and testing of hypotheses. This paper presents such a system that gathers and annotates online discussion relating to consumer products using a wide variety of state-of-the-art techniques, including crawling, wrapping, search, text classifi...
This paper describes the collection of a corpus of documents that contain one or more tables. Som... more This paper describes the collection of a corpus of documents that contain one or more tables. Some results are then presented which go some way to characterising the table in terms of its relationship to the content of the document it appears in.
Over the past few years, weblogs have emerged as a new communication and publication medium on th... more Over the past few years, weblogs have emerged as a new communication and publication medium on the Internet. In this paper, we describe the application of data mining, information extraction and NLP algorithms for discovering trends across our subset of approximately 100,000 weblogs. We publish daily lists of key persons, key phrases, and key paragraphs to a public web site, BlogPulse.com. In addition, we maintain a searchable index of weblog entries. On top of the search index, we have implemented trend search, which graphs the normalized trend line over time for a search query and provides a way to estimate the relative buzz of word of mouth for given topics over time.
Many social media platforms allow the user to provide profile information. This information is ge... more Many social media platforms allow the user to provide profile information. This information is generally presented in a semi-structured manner either on a profile page or on the weblog home page itself. This paper describes a novel wrapper induction method that extracts profile data. Our ultimate goal is to estimate the geographic distribution of weblog authors and to that end we provide an analysis of the location information discovered for each author in a large database of weblog posts.
Usenet is a decentralized discussion community predating blogs by decades. Just as there are a wi... more Usenet is a decentralized discussion community predating blogs by decades. Just as there are a wide range of political blogs, many Usenet sub-communities focus on politics. However, we find these two communities are very different in terms of source content. In this work we compare linking patterns of the political Usenet with a well-known political blog and news aggregator, Memeorandum, with respect to coverage of news and blog sources, attention of stories, and timeliness of links.
This document provides a L ATEX2Ç« sample (of the format- ting) of a paper for the International C... more This document provides a L ATEX2Ç« sample (of the format- ting) of a paper for the International Conference on Weblogs and Social Media (icwsm). Authors need to compile it us- ing L ATEX2Ç« and BibTeX. The formatting is adapted from the ACM's stylesheet sig-alternate.cls. The current doc- ument also uses some of the wording/examples/figures from the ACM's Alternate ACM SIG Proceedings Paper in LaTeX Format. The developers of the ACM stylesheet have made an effort to include lots of "goodies", such as a subtitle, footnotes on title, subtitle and authors, as well as in the text, and optional components (e.g., Appendices), not to mention examples of equations, theorems, tables and figures. If the abstract is longer than half the textheight perhaps you are trying to say too much.
Here we report some surprising findings of the blog linking and information propagation structure... more Here we report some surprising findings of the blog linking and information propagation structure, after we analyzed one of the largest available datasets, with 45,000 blogs and ~ 2.2 million blog-postings. Our analysis also sheds light on how rumors, viruses, and ideas propagate over social and computer networks. We also present a simple model that mimics the spread of information on the blogosphere, and produces information cascades very similar to those found in real life.
Uploads
Papers by Matthew Hurst