Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Rosa Meo

This Technical report provides a description of the experimental evaluation settings we adopted in project cInQ for deployment of query languages for data mining and inductive databases and their system prototypes in a real life... more
This Technical report provides a description of the experimental evaluation settings we adopted in project cInQ for deployment of query languages for data mining and inductive databases and their system prototypes in a real life applicative case study: WEB usage mining through Web Logs.
L’articolo si occupa dell’impatto dei big data sul patrimonio conoscitivo delle pubbliche amministrazioni a partire da una sperimentazione sulla banca dati dei contratti pubblici nazionali che ha coinvolto in prima persona l’Università di... more
L’articolo si occupa dell’impatto dei big data sul patrimonio conoscitivo delle pubbliche amministrazioni a partire da una sperimentazione sulla banca dati dei contratti pubblici nazionali che ha coinvolto in prima persona l’Università di Torino e l’Autorità Nazionale Anticorruzione (ANAC). L’articolo illustra le varie fasi che un informatico o data scientist seguono per giungere all’utilizzo dei dati ai fini conoscitivi: l’iniziale approccio statistico volto a identificare le caratteristiche descrittive dei casi oggetto di studio, è seguito dall’approccio descrittivo volto a individuare le regolarità e correlazioni della base di dati a disposizione; a queste due fasi, il ricercatore può aggiungere l’approccio predittivo attraverso le tecniche di machine learning. L’articolo conclude promuovendo l’approccio di tipo prescrittivo come funzionale all’indivi- duazione di decisioni che dovrebbero essere assunte sulla base dei dati a disposizione e che potrebbe suggerire delle buone pratiche future.
This work is framed on investigating how a robot can learn associations between linguistic elements, such as words or sentences, and its bodily perceptions, that we named "roboceptions". We discuss the possibility of defining such a... more
This work is framed on investigating how a robot can learn associations between linguistic elements, such as words or sentences, and its bodily perceptions, that we named "roboceptions". We discuss the possibility of defining such a process of an association through the interaction with human beings. By interacting with a user, the robot can learn to ascribe a meaning to its roboceptions to express them in natural language. Such a process could then be used by the robot in a verbal interaction to detect some words recalling the previously experimented roboceptions. In this paper, we discuss a Dual-NMT approach to realize such an association. However, it requires adequate training corpus. For this reason, we consider two different phases towards the realization of the system, and we show the results of the first phase, comparing two approaches: one based on the Latent Semantic Analysis paradigm and one based on the Random Indexing methodology.
Demand forecasting is one of the main challenges for retailers and wholesalers in any industry. Proper demand forecasting gives business valuable information about potential profits and helps managers in taking targeted decisions on... more
Demand forecasting is one of the main challenges for retailers and wholesalers in any industry. Proper demand forecasting gives business valuable information about potential profits and helps managers in taking targeted decisions on business growth strategies. Nowadays almost all organizations use different data sources or databases for nearly every aspect of their operations so that the knowledge on products on sale belongs to several independent views. The methodology described in this paper addresses the issue of product demand forecasting in fashion industry exploiting a multi-view learning approach. In particular, we show how the integration and connection among multiple views improves results accuracy. In real-life applications not all the views are usually available before a product is put on the market but the utility of a proper demand forecasting increases if the prediction is available before the product launch. We show that missing views can be reconstructed by means of common latent factors; in particular, this paper presents a learning procedure that describes the connection between different views. This connection allows data integration from multiple sources and can be extended to the special case of partial data representation. The nearest neighbors in the latent space play a special role for this process and for a general improvement of the forecast quality. We experimented the proposed methodology on real fashion retail sales showing that multi-view latent learning provides a system that is able to reconstruct satisfactorily non yet available views and can be used to predict the volumes of sales well before the goods are put on the market.
In this work we analyse data collected from sensors installed on some vehicles of the local public transportation system in a European city. Our analysis is conducted by means of generation and application of Bayesian networks to describe... more
In this work we analyse data collected from sensors installed on some vehicles of the local public transportation system in a European city. Our analysis is conducted by means of generation and application of Bayesian networks to describe the dependence relationships between variables and to predict the target variable of fuel consumption. We experimented with different algorithms that explore the search space of the possible alternatives guided by heuristics. We compare them with the results obtained with the technology of High Performance Computing, that allowed us to do an exhaustive search and find the optimal solution from the viewpoint of the likelihood evaluation measure. We solve the model evaluation and selection problem by application of an alternative evaluation measure: Granger causality. In addition we compared the predictive ability of the target by the obtained networks. Finally, we conducted "whatif" analysis under the form of intervention and counterfactual analysis and show which decisions policy makers and the service owners should afford to reduce costs and pollution.
We present two approaches for digital twinning in the context of the forecast of power production by photovoltaic panels. We employ two digital models that are complementary: the first one is a cyber-physical system, simulating the... more
We present two approaches for digital twinning in the context of the forecast of power production by photovoltaic panels. We employ two digital models that are complementary: the first one is a cyber-physical system, simulating the physical properties of a photovoltaic panel, built by the Open-Source Object-Oriented modeling language Modelica. The second model is data-driven, obtained by the application of Machine Learning techniques on the data collected in an installation of the equipment. Both approaches make use of data from the weather forecast of each day. We compare the results of the two approaches. Finally, we integrate them in more sophisticated hybrid systems that get the benefits of both.
Emotion analysis in social media is challenging. While most studies focus on positive and negative sentiments, the differentiation between emotions is more difficult. We investigate the problem as a collection of binary classification... more
Emotion analysis in social media is challenging. While most studies focus on positive and negative sentiments, the differentiation between emotions is more difficult. We investigate the problem as a collection of binary classification tasks on the basis of four opposing emotion pairs provided by Plutchik. We processed the content of messages by three alternative methods: structural and lexical features, latent factors, and natural language processing. The final prediction is suggested by classifiers deriving from the state of the art in machine learning. Results are convincing in the possibility to distinguish the emotions pairs in social media. CCS Concepts: r Computing methodologies → Machine learning approaches; Natural language processing; r Human-centered computing → Collaborative and social computing;
Data mining evolved as a collection of applicative problems and efficient solution algorithms relative to rather peculiar problems, all focused on the discovery of relevant information hidden in databases of huge dimensions. In... more
Data mining evolved as a collection of applicative problems and efficient solution algorithms relative to rather peculiar problems, all focused on the discovery of relevant information hidden in databases of huge dimensions. In particular, one of the most investigated topics is the discovery of association rules.
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values... more
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ordered. In this article, we propose a framework to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j . We validate our approach by embedding our distance learning framework in a hierarchical clustering algorithm. We applied it on various real world and synthetic datasets, both low and high-di...
Abstract. In this paper we present the application of the inductive database approach to two practical analytical case studies: Web usage mining in Web logs and financial data. As far as concerns the Web domain, we have considered the... more
Abstract. In this paper we present the application of the inductive database approach to two practical analytical case studies: Web usage mining in Web logs and financial data. As far as concerns the Web domain, we have considered the enriched XML Web logs, that we call conceptual logs, produced by specific Web applications. These ones have been built by using a conceptual model, namely WebML, and its accompanying CASE tool, WebRatio. The Web conceptual logs integrate the usual information about user requests with meta-data concerning the Web site structure. As far as concerns the analysis of financial data, we have considered the trade stock exchange index Dow Jones and studied its component stocks from 1997 to 2002 using the so-called technical analysis. Technical analysis consists in the identification of the relevant (graphical) patterns that occur in the plot of evolution of a stock quote as time proceeds, often adopting different time granularities. On the plots the correlatio...
Abstract. In recent years, researchers have begun to study inductive databases, a new generation of databases for leveraging decision support applications. In this context, the user interacts with the DBMS using advanced, constraint-based... more
Abstract. In recent years, researchers have begun to study inductive databases, a new generation of databases for leveraging decision support applications. In this context, the user interacts with the DBMS using advanced, constraint-based languages for data mining where constraints have been specifically introduced to increase the relevance of the results and, at the same time, to reduce its volume. In this paper we study the problem of mining frequent itemsets using an inductive database 1 . We propose a technique for query answering which consists in rewriting the query in terms of union and intersection of the result sets of other queries, previously executed and materialized. Unfortunately, the exploitation of past queries is not always applicable. We then present sufficient conditions for the optimization to apply and show that these conditions are strictly connected with the presence of functional dependencies between the attributes involved in the queries. We show some experi...
We present two approaches for digital twinning in the context of the forecast of power production by photovoltaic panels. We employ two digital models that are complementary: the first one is a cyber-physical system, simulating the... more
We present two approaches for digital twinning in the context of the forecast of power production by photovoltaic panels. We employ two digital models that are complementary: the first one is a cyber-physical system, simulating the physical properties of a photovoltaic panel, built by the Open- Source Object-Oriented modeling language Modelica. The second model is data-driven, obtained by the application of Machine Learning techniques on the data collected in an installation of the equipment. Both approaches make use of data from the weather forecast of each day. We compare the results of the two approaches. Finally, we integrate them in more sophisticated hybrid systems that get the benefits of both.
In this paper we propose and test the use of hierarchical clustering for feature selection in databases. The clustering method is Ward’s with a distance measure based on Goodman-Kruskal τ . We motivate the choice of this measure and... more
In this paper we propose and test the use of hierarchical clustering for feature selection in databases. The clustering method is Ward’s with a distance measure based on Goodman-Kruskal τ . We motivate the choice of this measure and compare it with other ones. Our hierarchical clustering is applied to over 40 data-sets from UCI archive. The proposed approach is interesting from many viewpoints. First, it produces the feature subsets dendrogram which serves as a valuable tool to study relevance relationships among features. Secondarily, the dendrogram is used in a feature selection algorithm to select the best features by a wrapper method. Experiments were run with three different families of classifiers: Naive Bayes, decision trees and k nearest neighbours. Our method allows all the three classifiers to generally outperform their corresponding ones without feature selection. We compare our feature selection with other state-of-the-art methods, obtaining on average a better classific...
In this paper we solve the problem of classifying chestnut plants according to their place of origin. We compare the results obtained by state of the art classifiers, among which, MLP, RBF, SVM, C4.5 decision tree and random forest. We... more
In this paper we solve the problem of classifying chestnut plants according to their place of origin. We compare the results obtained by state of the art classifiers, among which, MLP, RBF, SVM, C4.5 decision tree and random forest. We determine which features are meaningful for the classification, the achievable classification accuracy of these classifiers families with the available features and how much the classifiers are robust to noise. Among the obtained classifiers, neural networks show the greatest robustness to noise.
English. In this paper we describe the implementation of the MuMe dialogue system, a task-based dialogue system for a car sharing service, and its evaluation through the IDIAL protocol. Finally we report some comments on this novel... more
English. In this paper we describe the implementation of the MuMe dialogue system, a task-based dialogue system for a car sharing service, and its evaluation through the IDIAL protocol. Finally we report some comments on this novel dialogue system evaluation method.1 Italiano. In questo lavoro descriviamo l’implementazione del sistema di dialogo MuMe, realizzato per un sistema di car sharing, e la sua valutazione attraverso il protocollo IDIAL. Infine, offriamo alcuni commenti su questo nuovo metodo per la valutazione di sistemi di dialogo.
This three-volume set LNAI 8724, 8725 and 8726 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: ECML PKDD 2014, held in Nancy, France, in September 2014. The 115... more
This three-volume set LNAI 8724, 8725 and 8726 constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: ECML PKDD 2014, held in Nancy, France, in September 2014. The 115 revised research papers presented together with 13 demo track papers, 10 nectar track papers, 8 PhD track papers, and 9 invited talks were carefully reviewed and selected from 550 submissions. The papers cover the latest high-quality interdisciplinary research results in all areas related to machine learning and knowledge discovery in databases.
AS the number of submissions to the IEEE Transactions on Knowledge and Data Engineering (TKDE) and the diversity in topics keep increasing, TKDE needs fresh blood and strong hands. I am pleased to officially welcome the 13 associate... more
AS the number of submissions to the IEEE Transactions on Knowledge and Data Engineering (TKDE) and the diversity in topics keep increasing, TKDE needs fresh blood and strong hands. I am pleased to officially welcome the 13 associate editors who just joined the editorial board: Drs. Leman Akoglu, Hongrae Lee, Justin Levandoski, Xuelong Li, RosaMeo, Carlos Ordonez, Jeff Philips, Barbara Poblete, K. Selçuk Candan, Meng Wang, Jirong Wen, Li Xiong, and Wenjie Zhang. This group of newly appointed associate editors are established and active working experts in the wonderful wide spectrum of knowledge and data engineering.Moreover, they are very committed and dedicated to serving the community and handling the review processes, as testified by their rich experience. Their biographies and photos are provided below. At the same time, I want to sincerely thank Drs. Shivnath Babu, Sanjay Chawla, Xiaofei He, Daxin Jiang, Ruoming Jin, and Evaggelia Pitoura, who just retired from the editorial boa...
We propose DepMiner a software prototype implementing a simple but effective model for the evaluation of itemsets, and in general for the evaluation of the dependencies between the variables on a domain of finite values. This method is... more
We propose DepMiner a software prototype implementing a simple but effective model for the evaluation of itemsets, and in general for the evaluation of the dependencies between the variables on a domain of finite values. This method is based on ∆, the departure of the observed probability of a set of valued variables in a database and a referential probability, estimated in the condition of maximum entropy. This model is able to distinguish between dependencies intrinsic to the itemset and dependencies “inherited” from the subsets: thus it is suitable to directly compare the utility of an itemset with its subsets and to reduce the volume of non significant itemsets in the result of a frequent itemset mining request. This method is powerful because at the same time is able to detect significant positive dependencies as well as negative ones that occur when the association among the variables is rarer than expected. The system returns itemsets ranked by a normalized version of ∆ and t...
ABSTRACT The influence of training, posture, nutrition or psychological attitudes on an athlete’s career is well described in literature. An additional factor of success that is widely recognized as crucial is the network of matches that... more
ABSTRACT The influence of training, posture, nutrition or psychological attitudes on an athlete’s career is well described in literature. An additional factor of success that is widely recognized as crucial is the network of matches that an athlete plays during a season. The hypothesis is that the quality of a player’s opponents affects her long-term ranking and performance. Even though the relevance of these factors is widely recognized as important, a quantitative characterization is missing. In this paper, we try to fill this gap combining network analysis and machine learning to estimate the contribution of the network of matches in predicting an athlete’s success. We consider all the official games played by the Italian table tennis players between 2011 and 2016. We observe that the matches network shows scale-free behavior, typical of several real-world systems, and that different structural properties are positively correlated with the athletes’ performance (Spearman , p-value ). Using these findings, we implement three different tasks, such as talent identification, performance and ranking prediction. Results shows consistently that machine learning approaches are able to predict players’ success and that the topological features play an effective role in increasing their predictive power.
... CorGhiLanLeoMeoMonRov:2010-BIOBITS (Book part). Author(s), Francesca Cordero, Stefano Ghignone, Luisa Lanfranco, Giorgio Leonardi, Rosa Meo, Stefania Montani and Luca Roversi. Title, « BIOBITS: A Study on Candidatus ...
... 77 Onur G??rg??n and Olcay Taner Yildiz A Small Footprint Hybrid Statistical and Unit Selection Text-to-Speech Synthesis System for Turkish..... ... 143 H. Erkal, FM Ozcelik, MA Antepli, BT Bacinoglu and E. Uysal-Biyikoglu Page 8. ...
... CorGhiLanLeoMeoMonRov:2010-BIOBITS (Book part). Author(s), Francesca Cordero, Stefano Ghignone, Luisa Lanfranco, Giorgio Leonardi, Rosa Meo, Stefania Montani and Luca Roversi. Title, « BIOBITS: A Study on Candidatus ...
I. RATIONALE For the first year, ICDM hosts a Forum dedicated to PhD students. The aim of the ICDM PhD Forum is to provide an international environment in which PhD students can meet, exchange their ideas and experiences both with peers... more
I. RATIONALE For the first year, ICDM hosts a Forum dedicated to PhD students. The aim of the ICDM PhD Forum is to provide an international environment in which PhD students can meet, exchange their ideas and experiences both with peers and with senior researchers from the Data Mining Community, in an international scope. Here, PhD students have a unique opportunity to present their ideas and discuss on the work-inprogress in preparation of the PhD dissertation and on the major interests in the Data Mining field. The PhD Forum ...
We designed this track on" Data Mining" with an emphasis on declarative data mining, intelligent querying and associated issues such as optimization, indexing, query processing, languages and... more
We designed this track on" Data Mining" with an emphasis on declarative data mining, intelligent querying and associated issues such as optimization, indexing, query processing, languages and constraints, such as in previous two years in SAC. This year, attention is also placed to data preprocessing problems, such as data cleaning, discretization and sampling, etc. We encouraged and received also submissions of papers on new applications of data mining systems, such as in biology and science, in WEB analysis and XML documents ...
Research Interests:
Research Interests:
In this paper we present the application of the inductive database ap-proach to a practical analytical case study: analysis of nancial data. Inductive databases provide advanced support for Data Mining applica-tions, through the... more
In this paper we present the application of the inductive database ap-proach to a practical analytical case study: analysis of nancial data. Inductive databases provide advanced support for Data Mining applica-tions, through the integration of DBMS technology and powerful mining languages. In this case study, we have considered the trade stock exchange index Dow Jones 30 and studied its component stocks from 1997 to 2002 using the so-called technical analysis. Technical analysis consists in the identiication of the relevant (graphical) patterns that occur in the plot of evolution of a stock quote as time proceeds, often adopting diierent time granularities. On the plots the correlations between distinctive variables of the stocks quote are pointed out, such as the quote trend, the per-centage variation and the volume of the stocks exchanged. In particular we adopted candle-sticks, a gurative pattern representing in a condensed diagram the evolution of the stock quotes in a daily sto...

And 77 more

Nello scenario odierno, i dati hanno acquisito una posizione centrale in diversi ambiti tecnologici e non. La capacità di riuscire a gestire grandi quantità di dati, analizzarli statisticamente per avere conoscenza sulla distribuzione... more
Nello scenario odierno, i dati hanno acquisito una posizione centrale in diversi ambiti tecnologici e non. La capacità di riuscire a gestire grandi quantità di dati, analizzarli statisticamente per avere conoscenza sulla distribuzione delle loro caratteristiche ed estrarre informazioni utili ai fini del miglioramento e semplificazione dei processi di business (di gestione e produttivi); permetterebbe molti vantaggi nella risoluzione dei problemi e nello snellimento della burocrazia, e rappresenterebbe un motivo di vantaggio economico e tecnologico. 
L’Agenda Digitale è uno dei pilastri della Strategia “Europa 2020”, che indica gli obiettivi di crescita dell’UE fino al 2020 [Agenda Digitale]. A seguito della pandemia di COVID-19, con l'istituzione del Next Generation EU, la Commissione europea ha stanziato ulteriori finanziamenti puntando maggiormente sulla rivoluzione digitale indicata tra le sei priorità della Commissione europea per il 2019-2024. Ha lo scopo di fare leva sul potenziale delle tecnologie Informatiche, di Comunicazione e Tecnologiche (ICT) per favorire l’innovazione, il progresso e la crescita economica dei paesi della comunità Europea. L’obiettivo principale è lo sviluppo del mercato unico digitale, basato su tre aspetti:
1. Migliorare l'accesso a prodotti e servizi on line attraverso la rimozione delle barriere all'e-commerce;
2. Far crescere le reti di telecomunicazione e i servizi digitali;
3. Favorire la crescita sostenibile dell'economia digitale europea.
L’agenda Digitale fa leva sulle tecnologie digitali per la trasformazione dei processi informativi, produttivi e di consumo delle informazioni nella quale ogni Paese membro dell’UE si impegna all’interno del proprio ambito nazionale. La digitalizzazione, innovazione e sicurezza dei sistemi informativi della Pubblica Amministrazione (PA) sono una delle componenti di questa missione. In parte vi sono coinvolti anche il personale dipendente e la cittadinanza, che usufruisce dei servizi offerti, chiamati a partecipare a questo processo di innovazione, ad aumentare la propria capacità competitiva e cultura digitale. Digitalizzare le Pubbliche Amministrazioni significa anche uniformare i loro flussi informativi in modo da rendere interoperabili i loro sistemi informativi. Ciò permetterebbe di avere una banca dati generale, unica, condivisa nelle informazioni che si possono e si devono rendere comuni e pubbliche. Grazie alla questa condivisione si otterrebbero processi gestionali più efficienti (perché meglio informati), verificabili e trasparenti. Sui dati condivisi sarebbe possibile effettuare moltissime analisi che permetterebbero di accrescere la conoscenza sulle procedure e di conseguenza le renderebbero più efficienti e meno costose: sugli appalti di gara, sulle procedure svolte dalle amministrazioni, con un beneficio di razionalizzazione e contenimento della spesa pubblica.
I dati sono una ricchezza e sono stati denominati il petrolio della nostra era [Berners-Lee and Shadbolt, 2011]. Sui dati è basata la realizzazione di procedure di elaborazione automatizzata, la simulazione di scenari e fenomeni di interesse, il calcolo di indicatori di qualità e di performance da ottimizzare, e la previsione sugli andamenti futuri di variabili da monitorare. Molte di queste procedure sono basate sui principi dell’Apprendimento Automatico [ML], una branca dell’Intelligenza Artificiale [IA] che permette di automatizzare i processi e di stimare l’evoluzione futura.
Attualmente non sono molti i dati accessibili in maniera aperta (open) e questo penalizza fortemente la qualità dei modelli predittivi, ossia i modelli che fanno previsioni partendo dai dati accumulati.
In questo capitolo vogliamo far vedere come sarebbe possibile (e tutto sommato semplice e alla portata di molti) estrarre conoscenza utile dai dati che riguardano le Pubbliche Amministrazioni. Ciò sarebbe possibile se i dati gestionali e i flussi informativi fossero resi disponibili in formato aperto, in modo che si possano sviluppare metodi e software per la loro elaborazione e successivamente poterli rilasciare a disposizione anche di altre amministrazioni per il loro riutilizzo sui dati di loro pertinenza. Ciò permetterebbe di ottenere un “effetto leva”, ossia di poter agire secondo una economia di scala e riusare la conoscenza estratta o le tecniche utilizzate anche in altri ambiti . Una possibile ulteriore utilizzazione delle soluzioni software già sviluppate sarebbe ad esempio impiegarle come strumento per fornire suggerimento e confronto dei livelli di costo o di introito di beni o servizi nello stesso settore economico ma in zone geografiche diverse, o di confronto dello stesso bene o servizio impiegato in settori diversi.
In questo lavoro vogliamo applicare quindi tecniche di Apprendimento Automatico sui dati (open) che riguardano i flussi dei pagamenti delle Pubbliche Amministrazioni. Le entrate e le uscite delle PA, oggi sono pubblicate in maniera aperta tramite il Sistema SIOPE (Sistema Informativo sulle Operazioni degli Enti Pubblici)[SIOPE] che registra i pagamenti in maniera digitale.
SIOPE è nato dalla collaborazione tra la Ragioneria Generale dello Stato, la Banca d'Italia e l'ISTAT, in attuazione dall'articolo 28 della legge n. 289/2002, disciplinato dall’articolo 14 della legge n. 196 del 2009. Il sistema ha permesso di migliorare la rilevazione dei flussi di cassa della PA perché registra con tempestività una grande quantità di informazioni disponibili che permettono di tracciare l'andamento dei conti pubblici. Un aspetto chiave di SIOPE è stato la capacità di superare attraverso una codifica uniforme per tipologia di enti, le differenze tra i sistemi contabili delle singole amministrazioni pubbliche, senza incidere sulla struttura dei loro bilanci. Al momento SIOPE costituisce la principale fonte informativa per la predisposizione delle relazioni trimestrali sul conto consolidato di cassa. SIOPE rappresenta, pertanto, uno strumento fondamentale per il monitoraggio dei conti pubblici al fine della verifica delle regole previste dall'ordinamento comunitario (procedura su disavanzi eccessivi e Patto di stabilità e crescita). Gradualmente, il SIOPE è destinato ad essere esteso a tutte le Amministrazioni Pubbliche.
Quando parliamo di stime e predizioni, intendiamo algoritmi di Apprendimento Automatico che fanno uso dell’esperienza per migliorare la propria conoscenza di un fenomeno (o una variabile che interessa stimare) e le performance di stima (correttezza e precisione). L’esperienza è data dagli esempi che l’algoritmo di apprendimento prende in ingresso (input) per addestrare un modello. Il modello è costituito da un meccanismo di mappatura che è in grado di associare la variabile da predire ad altre variabili di input che descrivono gli esempi. L’applicazione del modello su dati non inclusi in partenza nell’insieme di addestramento permette di fare stime e previsioni più corrette della variabile di output di quanto si potrebbe fare senza modello, partendo solo dalle variabili di input. Ciò fa riferimento all’incremento della conoscenza ottenuto in seguito all’applicazione del modello a nuovi dati o esempi. Migliore è la qualità dei dati (rappresentativi della situazione di interesse) migliori saranno le performance e di conseguenza più alta sarà l’accuratezza della predizione. I modelli predittivi comprendono quindi tutte quelle tecniche che cercano di "interpretare" i dati, scovando le regolarità nei valori delle variabili e gli andamenti (pattern).
Uno dei molti task predittivi dell’Apprendimento Automatico, di cui vedremo un caso di studio, riguarda l’analisi delle serie storiche. Le serie storiche danno informazioni circa l’evoluzione di determinate variabili nel tempo [Serie storiche]. Ad esempio, informazioni su come varia il valore di una variabile meteorologica (come temperatura, umidità, irraggiamento, ecc) con  il passare delle ore o il valore dei titoli azionari quotati in borsa nel corso del tempo. Questo tipo di informazione molto spesso viene utilizzata per apprendere dallo storico l'andamento dei dati così da predirli nel futuro. Solitamente i modelli utilizzati per apprendere le serie temporali sono i modelli regressivi [Regressione] con tutte le loro varianti. Mentre i modelli standard di Machine Learning considerano i valori delle variabili di interesse negli istanti temporali passati alla stessa stregua rispetto a quelli degli istanti successivi, i modelli che si basano sulle serie temporali (in cui le osservazioni variano nel tempo) aggiungono un'esplicita dipendenza dall'ordine tra le osservazioni: ossia la dimensione temporale. Nello studio che vedremo, i modelli sono stati utilizzati per apprendere e predire le entrate e le uscite delle università pubbliche, classificate come grandi e mega (per quantità di studenti) nel territorio italiano. Al fine di ottenere il dataset, sono stati scaricati tramite il portale SIOPE  le serie storiche delle università contenenti gli importi di cassa (in entrata e in uscita) mensili dal 2008 ad oggi. I dati scaricati sono così organizzati:
- codice identificativo dell’ente
- codice del capitolo di spesa
- mese
- anno
- ammontare del flusso di cassa
Il codice del capitolo di spesa non è un dato utile e interpretabile perché non contiene una descrizione del tipo di spesa e quindi non verrà utilizzato. Invece, come operazione di consolidamento dei flussi di cassa mensili, tutte le uscite (e le entrate) dello stesso ente ma per capitoli di spesa diversi sono state aggregate tramite l’operazione di somma per dare luogo ad un unico importo mensile di uscita (o entrata).