The beginning of post-genomic era is characterized by a rising numbers of public collected genomes. The evolutionary relationship among these genomes may be caught by means of the comparative analysis of sequences, in order to identify... more
The beginning of post-genomic era is characterized by a rising numbers of public collected genomes. The evolutionary relationship among these genomes may be caught by means of the comparative analysis of sequences, in order to identify both homologous and non-coding functional elements. In this paper we report on the on-going BIOBITS project. It is focused on studies concerning the bacterial endosymbionts, since they offer an excellent model to investigate important biological events, such as organelle evolution, genome reduction, and transfer of genetic information among host lineages. The BIOBITS goal is two-side: on the one hand, it pursues a logical data representation of genomic and proteomic components. On the other hand, it aims at the development of software modules allowing the user to retrieve and analyze data in a flexible way.
In many domains (e.g., data mining, data management, data warehouse), a hierarchical organization of attribute values can help the data analysis process. Nevertheless, such hierarchical knowledge does not always available or even may be... more
In many domains (e.g., data mining, data management, data warehouse), a hierarchical organization of attribute values can help the data analysis process. Nevertheless, such hierarchical knowledge does not always available or even may be inadequate or useless when exists. Starting from this consideration, in this paper we tackle the problem of the automatic denition of data-driven taxonomies.To do this we combine techniques coming from information theory and clustering to obtain a structured representation of the attribute values: the Contextual Attribute-Value Taxonomy (CAVT). The two main advantages of our method are to be fully unsupervised (i.e., without any knowledge provided by an expert) and parameter-free. We experiments the bene- t of use CAVTs in the two following tasks: (i) the multilevel multidimensional sequential pattern mining problem in which hierarchies are involved to exploit abstraction over the data, (ii) the table summarization problem, in which the hierarchies are used to aggregate the data to supply a sketch of the original information to the user. To validate our approach we use real world datasets in which we obtain appreciable results regarding both quantitative and qualitative evaluation.
Microblogging is a modern communication paradigm in which users post bits of information, or “memes” as we call them, that are brief text updates or micromedia such as photos, video or audio clips. Once a user post a meme, it become... more
Microblogging is a modern communication paradigm in which users post bits of information, or “memes” as we call them, that are brief text updates or micromedia such as photos, video or audio clips. Once a user post a meme, it become visible to the user community. When a user finds a meme of another user interesting, she can eventually repost it, thus allowing memes to propagate virally trough the social network. In this paper we introduce the meme ranking problem, as the problem of selecting which k memes (among the ones posted by their contacts) to show to users when they log into the system. The objective is to maximize the overall activity of the network, that is, the total number of reposts that occur. We deeply characterize the problem showing that not only exact solutions are unfeasible, but also approximated solutions are prohibitive to be adopted in an on-line setting. Therefore we devise a set of heuristics and we compare them trough an extensive simulation based on the real-world Yahoo! Meme social graph, using parameters learnt from real logs of meme propagations. Our experimentation demonstrates the effectiveness and feasibility of these methods.
The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the... more
The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star struc- ture of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman-Kruskal’s τ, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical vari- ables. We extend τ to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes τ by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outper- forms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.
Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the... more
Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus “incorporating” data gradually into an on-going co-clustering solution.