Skip to main content

Fabio Crestani

Universita' della Svizzera Italiana, Faculty of Informatics, Faculty Member

Followers

81

Following

25

Public Views

Interests

Uploads

Papers by Fabio Crestani

Management of uncertainty and imprecision in multimedia information systems: Introducing this special issue

The technology of information access has gone through a slow but steady process of adapting to th... more The technology of information access has gone through a slow but steady process of adapting to the growth of availability of electronically stored information. When library were small, access to a piece of information could be achieved by asking the librarian, a" wise sage" who was supposed to have read every book in the library. The librarian could tell you which book contained the information you needed and where the book was located.

Automatic construction of hypertexts for self-referencing: the Hyper-TextBook project

We present the results of the Hyper-TextBook project. The aim of the project was to design, devel... more We present the results of the Hyper-TextBook project. The aim of the project was to design, develop and test a methodology and a tool for the fully automatic authoring of hypertexts from full-text documents. The target documents were textbooks because of their specific characteristics and usage, and the project aimed at automatically creating hypertextual versions of textbooks, ie hyper-textbooks. In this first phase of the project hyper-textbooks have been designed and implemented to be used mostly as self-reference sources.

On the generation of rich content metadata from social media

Abstract This contribution proposes a framework to generate auxiliary rich TV content metadata by... more Abstract This contribution proposes a framework to generate auxiliary rich TV content metadata by processing social networks data. Based on simple criteria to identify authoritative social media sources, we have analysed Twitter short messages relative to TV program content and devised a method to compute their informative value. We have extracted dozen of features and characterized such social data in terms of quality and relevancy.

Retrieval of spoken documents: first experiences

Combination of semantic and phonetic term similarity for spoken document retrieval and spoken query processing

Abstract In classical Information Retrieval systems a relevant document will not be retrieved in ... more Abstract In classical Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem is known as “term mismatch”. A similar problem can be found in spoken document retrieval and spoken query processing, where terms misrecognized by the speech recognition process can hinder the retrieval of potentially relevant documents. I will call this problem “term misrecognition”, by analogy to the term mismatch problem.

Strathprints Institutional Repository

Abstract. In large-scale distributed retrieval, challenges of latency, heterogeneity, and dynamic... more Abstract. In large-scale distributed retrieval, challenges of latency, heterogeneity, and dynamicity emphasise the importance of infrastructural support in reducing the development costs of state-of-the-art solutions. We present a service-based infrastructure for distributed retrieval which blends middleware facilities and a design framework to 'lift'the resource sharing approach and the computational services of a European Grid platform into the domain of e-Science applications.

Report on the 24th European colloquium on information retrieval research (ECIR 2002)

Information retrieval (IR) is the science and technology concerned with the effective and efficie... more Information retrieval (IR) is the science and technology concerned with the effective and efficient retrieval of information for the subsequent use by interested parties. The central problem in IR is the quest to find the set of relevant documents, amongst a large collection, containing the information sought thereby satisfying an information need usually expressed by a user with a query. The documents may be objects bearing information in any medium, text, image, audio, or, indeed a mixture of all three.

A Language Modelling approach to linking criminal styles with offender characteristics

The ability to infer the characteristics of offenders from their criminal behaviour ('offender pr... more The ability to infer the characteristics of offenders from their criminal behaviour ('offender profiling') has only been partially successful since it has relied on subjective judgments based on limited data. Words and structured data used in crime descriptions recorded by the police relate to behavioural features. Thus Language Modelling was applied to an existing police archive to link behavioural features with significant characteristics of offenders. Both multinomial and multiple Bernoulli models were used.

Context representation for web search results

Abstract Context has long been considered very useful to help the user assess the actual relevanc... more Abstract Context has long been considered very useful to help the user assess the actual relevance of a document. In web searching, context can help assess the relevance of a web page by showing how the page is related to other pages in the same web site, for example. Such information is very difficult to convey and visualize in a user friendly way.

Statistics of online user-generated short documents

Abstract. User-generated short documents assume an important role in online communication due to ... more Abstract. User-generated short documents assume an important role in online communication due to the established utilization of social networks and real-time text messaging on the Internet. In this paper we compare the statistics of different online user-generated datasets and traditional TREC collections, investigating their similarities and differences. Our results support the applicability of traditional techniques also to user-generated short documents albeit with proper preprocessing.

Towards query log based personalization using topic models

Abstract We investigate the utility of topic models for the task of personalizing search results ... more Abstract We investigate the utility of topic models for the task of personalizing search results based on information present in a large query log. We define generative models that take both the user and the clicked document into account when estimating the probability of query terms. These models can then be used to rank documents by their likelihood given a particular query and user pair.

Editorial message: special track on information access and retrieval

Abstract Information Retrieval (IR) aims at modelling, designing and implementing systems able to... more Abstract Information Retrieval (IR) aims at modelling, designing and implementing systems able to provide fast and effective content-based access to a large amount of information. Information can be of any kind: textual, visual, or auditory. The aim of such systems is to estimate the relevance of documents to a user information need. This is a very hard and complex task for many different reasons that a large volume of research has attempted to explain and tackle.

Proximity-based opinion retrieval

Abstract Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated... more Abstract Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated about a user's query. In this paper we propose a simple probabilistic model for assigning relevant opinion scores to documents. The key problem is how to capture opinion expressions in the document, that are related to the query topic. Current solutions enrich general opinion lexicons by finding query-specific opinion lexicons using pseudo-relevance feedback on external corpora or the collection itself.

Domain knowledge acquisition for Information Retrieval using neural networks

Abstract This paper presents the results of some experiments investigating the use of Neural Netw... more Abstract This paper presents the results of some experiments investigating the use of Neural Networks in the learning engine of an Connectionist Information Retrieval system called CIRS. CIRS uses the learning and generalisation capabilities of the Back Propagation learning algorithm to acquire and use application domain knowledge in the form of a sub-symbolic knowledge representation. This paper describes the architecture of CIRS and reports on experiments on three di erent learning strategies.

Resource selection and data fusion in multimedia distributed digital libraries

MIND is a EU funded project that addresses some of the issues that arise when people have routine... more MIND is a EU funded project that addresses some of the issues that arise when people have routine access to a large number (possibly thousands) of heterogeneous and distributed multimedia Digital Libraries (DLs) over the Internet and the Web. When so many DLs are available, the first information access task is resource selection. This is predominantly an ineffective manual task as users are unaware of the contents of each individual library in terms of quantity, quality, information type, provenance and likely relevance.

Bayesian latent variable models for collaborative item rating prediction

Abstract Collaborative filtering systems based on ratings make it easier for users to find conten... more Abstract Collaborative filtering systems based on ratings make it easier for users to find content of interest on the Web and as such they constitute an area of much research. In this paper we first present a Bayesian latent variable model for rating prediction that models ratings over each user's latent interests and also each item's latent topics.

Distributed information retrieval: a multi-objective resource selection approach

Information retrieval is becoming increasingly concerned with resource selection and data fusion ... more

Blog distillation using random walks

Abstract This paper addresses the blog distillation problem. That is, given a user query find the... more Abstract This paper addresses the blog distillation problem. That is, given a user query find the blogs most related to the query topic. We model the blogosphere as a single graph that includes extra information besides the content of the posts. By performing a random walk on this graph we extract most relevant blogs for each query. Our experiments on the TREC'07 data set show 15% improvement in MAP and 8% improvement in Precision@ 10 over the Language Modeling baseline.

PENG: integrated search of distributed news archives

News professionals, such as Radio, TV and Newsprint journalists and editors, now have at their di... more News professionals, such as Radio, TV and Newsprint journalists and editors, now have at their disposal a large and varied collection of digital information resources. News Agencies such as ANSA, Reuters and AP can, for example, provide live feeds of breaking stories directly into a newsroom. Journalists can also search and browse a variety of online news archives, digital libraries and web repositories when researching and compiling a report.

Mobile and Ubiquitous Information Access: Mobile HCI 2003 International Workshop, Udine, Italy, September 8, 2003, Revised and Invited Papers

This book constitutes the thoroughly refereed post-proceedings of the International Workshop on M... more This book constitutes the thoroughly refereed post-proceedings of the International Workshop on Mobile and Ubiquitous Information Access held in Udine, Italy in September 2003 during Mobile HCI 2003. Besides selected and revised workshop papers, several papers were specially invited to complete coverage of all relevant issues and extend the volume to a more representative survey of the state of the art in the area.

Management of uncertainty and imprecision in multimedia information systems: Introducing this special issue

The technology of information access has gone through a slow but steady process of adapting to th... more The technology of information access has gone through a slow but steady process of adapting to the growth of availability of electronically stored information. When library were small, access to a piece of information could be achieved by asking the librarian, a" wise sage" who was supposed to have read every book in the library. The librarian could tell you which book contained the information you needed and where the book was located.

Automatic construction of hypertexts for self-referencing: the Hyper-TextBook project

We present the results of the Hyper-TextBook project. The aim of the project was to design, devel... more We present the results of the Hyper-TextBook project. The aim of the project was to design, develop and test a methodology and a tool for the fully automatic authoring of hypertexts from full-text documents. The target documents were textbooks because of their specific characteristics and usage, and the project aimed at automatically creating hypertextual versions of textbooks, ie hyper-textbooks. In this first phase of the project hyper-textbooks have been designed and implemented to be used mostly as self-reference sources.

On the generation of rich content metadata from social media

Abstract This contribution proposes a framework to generate auxiliary rich TV content metadata by... more Abstract This contribution proposes a framework to generate auxiliary rich TV content metadata by processing social networks data. Based on simple criteria to identify authoritative social media sources, we have analysed Twitter short messages relative to TV program content and devised a method to compute their informative value. We have extracted dozen of features and characterized such social data in terms of quality and relevancy.

Retrieval of spoken documents: first experiences

Combination of semantic and phonetic term similarity for spoken document retrieval and spoken query processing

Abstract In classical Information Retrieval systems a relevant document will not be retrieved in ... more Abstract In classical Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem is known as “term mismatch”. A similar problem can be found in spoken document retrieval and spoken query processing, where terms misrecognized by the speech recognition process can hinder the retrieval of potentially relevant documents. I will call this problem “term misrecognition”, by analogy to the term mismatch problem.

Strathprints Institutional Repository

Abstract. In large-scale distributed retrieval, challenges of latency, heterogeneity, and dynamic... more Abstract. In large-scale distributed retrieval, challenges of latency, heterogeneity, and dynamicity emphasise the importance of infrastructural support in reducing the development costs of state-of-the-art solutions. We present a service-based infrastructure for distributed retrieval which blends middleware facilities and a design framework to 'lift'the resource sharing approach and the computational services of a European Grid platform into the domain of e-Science applications.

Report on the 24th European colloquium on information retrieval research (ECIR 2002)

Information retrieval (IR) is the science and technology concerned with the effective and efficie... more Information retrieval (IR) is the science and technology concerned with the effective and efficient retrieval of information for the subsequent use by interested parties. The central problem in IR is the quest to find the set of relevant documents, amongst a large collection, containing the information sought thereby satisfying an information need usually expressed by a user with a query. The documents may be objects bearing information in any medium, text, image, audio, or, indeed a mixture of all three.

A Language Modelling approach to linking criminal styles with offender characteristics

The ability to infer the characteristics of offenders from their criminal behaviour ('offender pr... more The ability to infer the characteristics of offenders from their criminal behaviour ('offender profiling') has only been partially successful since it has relied on subjective judgments based on limited data. Words and structured data used in crime descriptions recorded by the police relate to behavioural features. Thus Language Modelling was applied to an existing police archive to link behavioural features with significant characteristics of offenders. Both multinomial and multiple Bernoulli models were used.

Context representation for web search results

Abstract Context has long been considered very useful to help the user assess the actual relevanc... more Abstract Context has long been considered very useful to help the user assess the actual relevance of a document. In web searching, context can help assess the relevance of a web page by showing how the page is related to other pages in the same web site, for example. Such information is very difficult to convey and visualize in a user friendly way.

Statistics of online user-generated short documents

Abstract. User-generated short documents assume an important role in online communication due to ... more Abstract. User-generated short documents assume an important role in online communication due to the established utilization of social networks and real-time text messaging on the Internet. In this paper we compare the statistics of different online user-generated datasets and traditional TREC collections, investigating their similarities and differences. Our results support the applicability of traditional techniques also to user-generated short documents albeit with proper preprocessing.

Towards query log based personalization using topic models

Abstract We investigate the utility of topic models for the task of personalizing search results ... more Abstract We investigate the utility of topic models for the task of personalizing search results based on information present in a large query log. We define generative models that take both the user and the clicked document into account when estimating the probability of query terms. These models can then be used to rank documents by their likelihood given a particular query and user pair.

Editorial message: special track on information access and retrieval

Abstract Information Retrieval (IR) aims at modelling, designing and implementing systems able to... more Abstract Information Retrieval (IR) aims at modelling, designing and implementing systems able to provide fast and effective content-based access to a large amount of information. Information can be of any kind: textual, visual, or auditory. The aim of such systems is to estimate the relevance of documents to a user information need. This is a very hard and complex task for many different reasons that a large volume of research has attempted to explain and tackle.

Proximity-based opinion retrieval

Abstract Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated... more Abstract Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated about a user's query. In this paper we propose a simple probabilistic model for assigning relevant opinion scores to documents. The key problem is how to capture opinion expressions in the document, that are related to the query topic. Current solutions enrich general opinion lexicons by finding query-specific opinion lexicons using pseudo-relevance feedback on external corpora or the collection itself.

Domain knowledge acquisition for Information Retrieval using neural networks

Abstract This paper presents the results of some experiments investigating the use of Neural Netw... more Abstract This paper presents the results of some experiments investigating the use of Neural Networks in the learning engine of an Connectionist Information Retrieval system called CIRS. CIRS uses the learning and generalisation capabilities of the Back Propagation learning algorithm to acquire and use application domain knowledge in the form of a sub-symbolic knowledge representation. This paper describes the architecture of CIRS and reports on experiments on three di erent learning strategies.

Resource selection and data fusion in multimedia distributed digital libraries

MIND is a EU funded project that addresses some of the issues that arise when people have routine... more MIND is a EU funded project that addresses some of the issues that arise when people have routine access to a large number (possibly thousands) of heterogeneous and distributed multimedia Digital Libraries (DLs) over the Internet and the Web. When so many DLs are available, the first information access task is resource selection. This is predominantly an ineffective manual task as users are unaware of the contents of each individual library in terms of quantity, quality, information type, provenance and likely relevance.

Bayesian latent variable models for collaborative item rating prediction

Abstract Collaborative filtering systems based on ratings make it easier for users to find conten... more Abstract Collaborative filtering systems based on ratings make it easier for users to find content of interest on the Web and as such they constitute an area of much research. In this paper we first present a Bayesian latent variable model for rating prediction that models ratings over each user's latent interests and also each item's latent topics.

Distributed information retrieval: a multi-objective resource selection approach

Information retrieval is becoming increasingly concerned with resource selection and data fusion ... more

Blog distillation using random walks

Abstract This paper addresses the blog distillation problem. That is, given a user query find the... more Abstract This paper addresses the blog distillation problem. That is, given a user query find the blogs most related to the query topic. We model the blogosphere as a single graph that includes extra information besides the content of the posts. By performing a random walk on this graph we extract most relevant blogs for each query. Our experiments on the TREC'07 data set show 15% improvement in MAP and 8% improvement in Precision@ 10 over the Language Modeling baseline.

PENG: integrated search of distributed news archives

News professionals, such as Radio, TV and Newsprint journalists and editors, now have at their di... more News professionals, such as Radio, TV and Newsprint journalists and editors, now have at their disposal a large and varied collection of digital information resources. News Agencies such as ANSA, Reuters and AP can, for example, provide live feeds of breaking stories directly into a newsroom. Journalists can also search and browse a variety of online news archives, digital libraries and web repositories when researching and compiling a report.

Mobile and Ubiquitous Information Access: Mobile HCI 2003 International Workshop, Udine, Italy, September 8, 2003, Revised and Invited Papers

This book constitutes the thoroughly refereed post-proceedings of the International Workshop on M... more This book constitutes the thoroughly refereed post-proceedings of the International Workshop on Mobile and Ubiquitous Information Access held in Udine, Italy in September 2003 during Mobile HCI 2003. Besides selected and revised workshop papers, several papers were specially invited to complete coverage of all relevant issues and extend the volume to a more representative survey of the state of the art in the area.