This document introduces machine learning strategies for time series prediction. It begins with an introduction to the speaker and his background and research interests. It then provides an outline of the topics to be covered, including notions of time series, machine learning approaches for prediction, local learning methods, forecasting techniques, and applications and future directions. The document discusses what the audience should know coming into the course and what they will learn.
Usage of AI and machine learning models is likely to become more commonplace as larger swaths of the economy embrace automation and data-driven decision-making. While these predictive systems can be quite accurate, they have been treated as inscrutable black boxes in the past, that produce only numeric predictions with no accompanying explanations. Unfortunately, recent studies and recent events have drawn attention to mathematical and sociological flaws in prominent weak AI and ML systems, but practitioners usually don’t have the right tools to pry open machine learning black-boxes and debug them.
This presentation introduces several new approaches to that increase transparency, accountability, and trustworthiness in machine learning models. If you are a data scientist or analyst and you want to explain a machine learning model to your customers or managers (or if you have concerns about documentation, validation, or regulatory requirements), then this presentation is for you!
Robust Feature Learning with Deep Neural Networks
http://snu-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/display.do?tabs=viewOnlineTab&doc=82SNU_INST21557911060002591
The document discusses hyperparameters and hyperparameter tuning in deep learning models. It defines hyperparameters as parameters that govern how the model parameters (weights and biases) are determined during training, in contrast to model parameters which are learned from the training data. Important hyperparameters include the learning rate, number of layers and units, and activation functions. The goal of training is for the model to perform optimally on unseen test data. Model selection, such as through cross-validation, is used to select the optimal hyperparameters. Training, validation, and test sets are also discussed, with the validation set used for model selection and the test set providing an unbiased evaluation of the fully trained model.
Introduction to Interpretable Machine LearningNguyen Giang
This document discusses interpretable machine learning and explainable AI. It begins with definitions of key terms and an overview of interpretable methods. Deep learning models are often treated as "black boxes" that are difficult to interpret. Interpretability can be achieved by using inherently interpretable models like linear models or decision trees, adding attention mechanisms, or interpreting models before, during or after building them. Later sections discuss specific interpretable techniques like understanding data through examples, MMD-Critic for learning prototypes and criticisms, and visualizing convolutional neural networks to understand predictions. The document emphasizes the importance of interpretability and explains several approaches to make machine learning models more transparent to humans.
Federated Learning makes it possible to build machine learning systems without direct access to training data. The data remains in its original location, which helps to ensure privacy, reduces network communication costs, and taps edge device computing resources. The principles of data minimization established by the GDPR, and the growing prevalence of smart sensors make the advantages of federated learning more compelling. Federated learning is a great fit for smartphones, industrial and consumer IoT, healthcare and other privacy-sensitive use cases, and industrial sensor applications.
We’ll present the Fast Forward Labs team’s research on this topic and the accompanying prototype application, “Turbofan Tycoon”: a simplified working example of federated learning applied to a predictive maintenance problem. In this demo scenario, customers of an industrial turbofan manufacturer are not willing to share the details of how their components failed with the manufacturer, but want the manufacturer to provide them with a strategy to maintain the part. Federated learning allows us to satisfy the customer's privacy concerns while providing them with a model that leads to fewer costly failures and less maintenance downtime.
We’ll discuss the advantages and tradeoffs of taking the federated approach. We’ll assess the state of tooling for federated learning, circumstances in which you might want to consider applying it, and the challenges you’d face along the way.
Speaker
Chris Wallace
Data Scientist
Cloudera
This document discusses machine learning techniques for time series analysis and financial applications. It introduces common machine learning models like neural networks, convolutional neural networks, and LSTM networks. It also covers important concepts like feature engineering, labeling financial data, model evaluation techniques like cross validation, and analyzing feature importance. The document provides an overview of machine learning fundamentals and challenges in applying these techniques to time series and trading applications.
The document presents a machine learning model to predict disease severity and outcomes in COVID-19 patients. It discusses using various machine learning algorithms like random forest, support vector machine, classification and regression trees, genetic algorithm, and artificial neural networks on COVID-19 patient data. The objectives are to build a model that can predict the spread of the virus in the next 7 days and analyze current conditions in India to see if it will face a crisis like other countries. Literature on existing COVID-19 machine learning studies is reviewed and research gaps are identified like improving data collection. The Italy dataset with 14 attributes is used and the different algorithms' accuracies will be calculated to identify the best approach.
A sharing talk in Hsinchu Coders.
The materials (i.e. images) are from their respective owners:
https://research.googleblog.com/2017/04/federated-learning-collaborative.html
Deep learning is a type of machine learning that uses neural networks inspired by the human brain. It has been successfully applied to problems like image recognition, speech recognition, and natural language processing. Deep learning requires large datasets, clear goals, computing power, and neural network architectures. Popular deep learning models include convolutional neural networks and recurrent neural networks. Researchers like Geoffry Hinton and companies like Google have advanced the field through innovations that have won image recognition challenges. Deep learning will continue solving harder artificial intelligence problems by learning from massive amounts of data.
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
Given the resurgence of neural network-based techniques in recent years, it is important for data science practitioner to understand how to apply these techniques and the tradeoffs between neural network-based and traditional statistical methods.
This lecture discusses two specific techniques: Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN). The former is one of the most important class of multivariate time series statistical models applied in finance while the latter is a neural network architecture that is suitable for time series forecasting. I’ll demonstrate how they are implemented in practice and compares their advantages and disadvantages. Real-world applications, demonstrated using python and Spark, are used to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using time-series plot, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, plots of cross-correlations, histogram, and kernel density plot, will also be included in the demo.
The attendees will learn – the formulation of a time series forecasting problem statement in context of VAR and RNN – the application of Recurrent Neural Network-based techniques in time series forecasting – the application of Vector Autoregressive Models in multivariate time series forecasting – the pros and cons of using VAR and RNN-based techniques in the context of financial time series forecasting – When to use VAR and when to use RNN-based techniques
Index.....................
History of Machine Learning.
What is Machine Learning.
Why ML.
Learning System Model.
Training and Testing.
Performance.
Algorithms.
Machine Learning Structure.
Application.
Conclusion.
----------------------------------------------
THANK YOU
Machine learning can be distributed across multiple machines to allow for processing of large datasets and complex models. There are three main approaches to distributed machine learning: data parallel, where the data is partitioned across machines and models are replicated; model parallel, where different parts of large models are distributed; and graph parallel, where graphs and algorithms are partitioned. Distributed frameworks use these approaches to efficiently and scalably train machine learning models on big data in parallel.
Active learning is a machine learning technique where the learner is able to interactively query the oracle (e.g. a human) to obtain labels for new data points in an effort to learn more accurately from fewer labeled examples. The learner selects the most informative samples to be labeled by the oracle, such as samples closest to the decision boundary or where models disagree most. This allows the learner to minimize the number of labeled samples needed, thus reducing the cost of training an accurate model. Suggested improvements include querying batches of samples instead of single samples and accounting for varying labeling costs.
GANs are the new hottest topic in the ML arena; however, they present a challenge for the researchers and the engineers alike. Their design, and most importantly, the code implementation has been causing headaches to the ML practitioners, especially when moving to production.
Starting from the very basic of what a GAN is, passing trough Tensorflow implementation, using the most cutting-edge APIs available in the framework, and finally, production-ready serving at scale using Google Cloud ML Engine.
Slides for the talk: https://www.pycon.it/conference/talks/deep-diving-into-gans-form-theory-to-production
Github repo: https://github.com/zurutech/gans-from-theory-to-production
Automated machine learning (AutoML) systems can find the optimal machine learning algorithm and hyperparameters for a given dataset without human intervention. AutoML addresses the skills gap in data science by allowing data scientists to build more models in less time. On average, tuning hyperparameters results in a 5-10% improvement in accuracy over default parameters. However, the best parameters vary across problems. AutoML tools like Auto-sklearn use techniques like Bayesian optimization and meta-learning to efficiently search the hyperparameter space. Auto-sklearn has won several AutoML challenges due to its ability to effectively optimize over 100 hyperparameters.
This describes the supervised machine learning, supervised learning categorisation( regression and classification) and their types, applications of supervised machine learning, etc.
Machine learning works by processing data to discover patterns that can be used to analyze new data. Popular programming languages for machine learning include Python, R, and SQL. There are several types of machine learning including supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and deep learning. Common machine learning tasks involve classification, regression, clustering, dimensionality reduction, and model selection. Machine learning is widely used for applications such as spam filtering, recommendations, speech recognition, and machine translation.
A Monte Carlo strategy for structure multiple-step-head time series predictionGianluca Bontempi
The document proposes a Monte Carlo approach called SMC (Structured Monte Carlo) for multiple-step-ahead time series forecasting that takes into account the structural dependencies between predictions. It generates samples using a direct forecasting approach and weights them based on how well they satisfy dependencies identified by an iterated approach. Experiments on three benchmark datasets show the SMC approach achieves more accurate forecasts as measured by SMAPE than iterated, direct, or other comparison methods for most prediction horizons tested.
This document presents a general framework for enhancing time series prediction performance. It discusses using multiple predictions from a base method like neural networks, ARIMA or Holt-Winters to improve accuracy. Short-term enhancement uses support vector regression on statistic and reliability features of the multiple predictions to enhance 1-step ahead predictions. Long-term enhancement trains additional models on the short-term predictions to enhance longer-horizon predictions. The framework is evaluated on traffic flow data with prediction horizons of 1 week and 13 weeks.
Computational Intelligence for Time Series PredictionGianluca Bontempi
This document provides an overview of computational intelligence methods for time series prediction. It begins with introductions to time series analysis and machine learning approaches for prediction. Specific models discussed include autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) processes. Parameter estimation techniques for AR models are also covered. The document outlines applications in areas like forecasting, wireless sensors, and biomedicine and concludes with perspectives on future directions.
Time series analysis examines patterns in data over time. It relies on identifying trends, measuring past patterns to forecast the future, and decomposing time series into four main components: secular trends, cyclical movements, seasonal variations, and irregular variations. Secular trends represent long-term direction, while cyclical and seasonal variations have recurring patterns over different time scales. Various techniques can depict trends and identify variations, including freehand drawing, semi-averages, moving averages, least squares, and exponential smoothing.
The document discusses steps for identifying and building ARIMA models for time series data. It describes ARIMA models as consisting of three components - identification, estimation, and diagnostic checking. For identification, it explains how to determine the p, d, and q values by examining the autocorrelation and partial autocorrelation functions of stationary differenced time series data. It then discusses using the method of moments to estimate ARIMA model parameters by equating sample statistics to population parameters.
This document provides an overview of time series analysis and forecasting techniques. It discusses key concepts such as stationary and non-stationary time series, additive and multiplicative models, smoothing methods like moving averages and exponential smoothing, autoregressive (AR), moving average (MA) and autoregressive integrated moving average (ARIMA) models. The document uses examples to illustrate how to identify patterns in time series data and select appropriate models for description, explanation and forecasting of time series.
Extending and integrating a hybrid knowledge representation system into the c...Valentina Rho
This document summarizes Valentina Rho's thesis on extending a hybrid knowledge representation system called Dual-PECCS into the cognitive architecture ACT-R. Dual-PECCS represents concepts using both classical and typical information based on dual-process theory. The objectives were (a) extending Dual-PECCS and (b) integrating it into ACT-R. Dual-PECCS was translated into ACT-R chunks and a new action allows accessing its subsystems. Experiments using riddles showed the integrated system had similar accuracy to humans in conceptual categorization and representation proxyfication. Future work includes improving generalization in the typical system and integrating Dual-PECCS into other architectures.
Electricity price forecasting with Recurrent Neural NetworksTaegyun Jeon
This document discusses using recurrent neural networks (RNNs) for electricity price forecasting with TensorFlow. It begins with an introduction to the speaker, Taegyun Jeon from GIST. The document then provides an overview of RNNs and their implementation in TensorFlow. It describes two case studies - using an RNN to predict a sine function and using one to forecast electricity prices. The document concludes with information on running and evaluating the RNN graph and a question and answer section.
This document provides an introduction to exploring and visualizing data using the R programming language. It discusses the history and development of R, introduces key R packages like tidyverse and ggplot2 for data analysis and visualization, and provides examples of reading data, examining data structures, and creating basic plots and histograms. It also demonstrates more advanced ggplot2 concepts like faceting, mapping variables to aesthetics, using different geoms, and combining multiple geoms in a single plot.
How to win data science competitions with Deep LearningSri Ambati
This document summarizes a presentation about how to win data science competitions using deep learning with H2O. It discusses H2O's architecture and capabilities for deep learning. It then demonstrates live modeling on Kaggle competitions, providing step-by-step explanations of building and evaluating deep learning models on three different datasets - an African soil properties prediction challenge, a display advertising challenge, and a Higgs boson machine learning challenge. It concludes with tips and tricks for deep learning with H2O and an invitation to the H2O World conference.
This document discusses applying data mining techniques to analyze active users on Reddit. It defines active users as those who posted or commented in at least 5 subreddits and have at least 5 posts/comments in each subreddit. The preprocessing steps extract over 25,000 active users and their posts from the raw Reddit data. K-means clustering is then used to cluster the active users into 10 groups based on their activities to gain insights into different types of active users on Reddit.
Transform your Business with AI, Deep Learning and Machine LearningSri Ambati
Video: https://www.youtube.com/watch?v=R3IXd1iwqjc
Meetup: http://www.meetup.com/SF-Bay-ACM/events/231709894/
In this talk, Arno Candel presents a brief history of AI and how Deep Learning and Machine Learning techniques are transforming our everyday lives. Arno will introduce H2O, a scalable open-source machine learning platform, and show live demos on how to train sophisticated machine learning models on large distributed datasets. He will show how data scientists and application developers can use the Flow GUI, R, Python, Java, Scala, JavaScript and JSON to build smarter applications, and how to take them to production. He will present customer use cases from verticals including insurance, fraud, churn, fintech, and marketing.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Some resources how to navigate in the hardware space in order to build your own workstation for training deep learning models.
Alternative download link: https://www.dropbox.com/s/o7cwla30xtf9r74/deepLearning_buildComputer.pdf?dl=0
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
This document provides an overview of deep learning, including conceptual understanding and applications. It defines deep learning as a deep and wide artificial neural network. It describes key concepts in artificial neural networks like signal transmission between neurons, graphical models, linear/logistic regression, weights/biases/activation, and backpropagation. It also discusses popular deep learning applications and techniques like speech recognition, natural language processing, computer vision, representation learning using restricted Boltzmann machines and autoencoders, and deep network architectures.
This document provides an introduction and outline for a machine learning summer school session on machine learning strategies for time series prediction. It introduces the speaker, Gianluca Bontempi, and his background in machine learning. It then discusses what attendees should know and expect to learn, including foundations of statistical machine learning and strategies for forecasting. The outline presents topics that will be covered, including notions of time series, machine learning for prediction, local learning approaches, forecasting techniques, applications, and future directions.
The document describes a research team that studies emerging methodologies in economics and finance, including complex networks and machine learning. The team is based at Democritus University of Thrace in Greece and led by Drs. Periklis Gogas and Theophilos Papadimitriou. Their research focuses on using techniques like minimum dominating sets and support vector machines to analyze economic and financial networks and time series data. They have found these emerging methodologies can more accurately represent network structures and forecast macroeconomic and financial variables compared to traditional econometric methods.
The document discusses the importance of mathematics in petroleum engineering. It outlines how mathematical modeling allows for virtual experimentation and analysis of complex phenomena. Areas like differential equations, numerical analysis, statistics, and harmonic analysis play key roles in modeling processes for applications such as image processing, telecommunications, data transmission, and oil prospecting. Developing strong links between mathematics and petroleum engineering is important for addressing challenges through interdisciplinary collaboration.
Time Series Weather Forecasting Techniques: Literature SurveyIRJET Journal
This document summarizes various time series forecasting techniques discussed in literature, including ARIMA, Prophet, and LSTM models. It reviews their applications in weather forecasting, analyzing COVID-19 data, real estate prices, bitcoin values, and more. The key techniques are compared based on their forecasting accuracy on different datasets. ARIMA is generally good at capturing trends but requires stationary data, while Prophet and LSTM can handle non-stationary data and seasonal effects better. Prophet achieved 91% accuracy on a COVID dataset, outperforming ARIMA. LSTM achieved 76% accuracy for rainfall forecasting. The document concludes different approaches are still being developed to address the unique challenges of weather data forecasting.
THM1: Formalizing a problem as a prediction problem is often the most important contribution of a data scientist.
THM2: A predictor is an estimator, i.e. an algorithm which takes data and returns a prediction. Reality is stochastic, so data and predictions are stochastic.
THM3: Learning is challenging since data must be used both to create prediction models and to assess them. Bias and variance must be balanced to achieve good generalization.
This document discusses mathematical modeling and computer simulation. It begins by defining key terms such as mathematical model, computer modeling, and computer simulation. It then explains the process of creating a mathematical model to represent a real-world phenomenon and using that model to perform computer simulations. The success of a mathematical model depends on how accurately it predicts and explains the phenomenon. Examples of modeling a falling rock are provided to illustrate the modeling process.
Efficient commodity price forecasting using long short-term memory modelIAESIJAI
Predicting commodity prices, particularly food prices, is a significant concern for various stakeholders, especially in regions that are highly sensitive to commodity price volatility. Historically, many machine learning models like autoregressive integrated moving average (ARIMA) and support vector machine (SVM) have been suggested to overcome the forecasting task. These models struggle to capture the multifaceted and dynamic factors influencing these prices. Recently, deep learning approaches have demonstrated considerable promise in handling complex forecasting tasks. This paper presents a novel long short-term memory (LSTM) network-based model for commodity price forecasting. The model uses five essential commodities namely bread, meat, milk, oil, and petrol. The proposed model focuses on advanced feature engineering which involves moving averages, price volatility, and past prices. The results reveal that our model outperforms traditional methods as it achieves 0.14, 3.04%, and 98.2% for root mean square error (RMSE), mean absolute percentage error (MAPE), and R-squared (R2 ), respectively. In addition to the simplicity of the model, which consists of an LSTM single-cell architecture that reduced the training time to a few minutes instead of hours. This paper contributes to the economic literature on price prediction using advanced deep learning techniques as well as provides practical implications for managing commodity price instability globally.
This document discusses limitations and applications of statistics. It begins by covering limitations of statistics, such as it only dealing with quantitative data and groups/aggregates, and possible errors in statistical analysis. It then covers many fields that statistics can be applied to, such as actuarial science, biostatistics, econometrics, environmental statistics, epidemiology, and others. It concludes with sample multiple choice questions related to limitations and applications of statistics.
1. The document discusses simulation as a technique used to study and analyze the behavior of systems over time. Simulation involves creating a computer-based model of a real-world system to draw conclusions about how it operates.
2. Simulation can be used for task training, decision-making, scientific research, and predicting the behavior of natural systems. It allows testing alternatives without committing resources.
3. The document provides examples of how simulation can be used to model the operations of cooperative societies and banks to help students better understand commercial mathematics topics.
1. The document discusses simulation as a technique used to study and analyze the behavior of actual or theoretical systems by creating computer-based models. It is used when directly studying real systems is not possible or practical.
2. Simulation models can be static or dynamic, discrete or continuous, and deterministic or stochastic. They are composed of mathematical and logical relationships that are analyzed using numerical rather than analytical methods.
3. Simulation has many applications including manufacturing and materials handling systems. It allows testing designs and systems virtually before implementing them in the real world. It provides insights into how systems work and which variables most impact performance.
This document provides an introduction to machine learning, including definitions, applications, and types of problems. It discusses how machine learning can be applied to data from sensors in industrial settings for tasks like risk prevention in real-time, searching historical data for patterns related to failures, estimating machine lifespan and product quality, and enabling more automated intelligent systems. Large amounts of data are generated every day from various sources, making it impossible for humans to analyze, but machine learning can find hidden patterns in this data.
Operational research (OR) is a discipline that deals with applying advanced analytical methods to help make better decisions. OR uses scientific methods and especially mathematical modeling to study complex problems. It is considered a subfield of applied mathematics. Some key applications of OR include scheduling, facility planning, planning and forecasting, credit scoring, marketing, and defense planning. OR takes a systems approach, uses interdisciplinary teams, and aims to optimize objectives subject to constraints through quantitative modeling and analysis.
A heuristic approach for optimizing travel planning using genetics algorithmeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A heuristic approach for optimizing travel planning using genetics algorithmeSAT Journals
Abstract In today’s fast-paced society, everyone is caught up in the hustle and bustle of life which has resulted in ineffective Planning of their very important vacation tour. Either they spend much time on deciding what to do next, or will take many unnecessary, unfocused and inefficient steps. The main purpose of our project is to develop a Travel Planner that will allow the customer to plan the entire tour so that he visits many places in less time. The concept would be implemented using Genetics Algorithm of Artificial Intelligence which would be used as a search algorithm to find the nearest optimal travel path. Moreover, In order to reduce the running time of GA, Parallelization of Genetics Algorithm would be demonstrated using Hadoop Framework. Key Words: Genetics Algorithm, TSP, Hadoop, and MapReduce etc…
Realtime anomaly detection in surveillance data.pptxKingrockPeter
The complexity of both natural and technological systems has reduced the ability of humans to monitor, detect and fix anomalies before they occur and in real-time. In this talk, I examine types of anomalies and the different machine learning methods that can be applied to detect the anomalies in time series signals. Further, I present results in detecting anomaly in epidemiological data with noise and uncertainties. I then extend the discussions of the methods and results to performance and security anomaly detection in cloud computing environments. I conclude that despite the source of a signal that is being analysed for anomaly detection, the concept of outliers, weak signals and noisy signal processing can be combined with different model ensembles to achieve timely and robust detection of abnormal signal changes just before they occur.
Do we measure functional size or do we count thomas fehlmannIWSM Mensura
This document contains the slides from a presentation given by Dr. Thomas Fehlmann on October 7, 2015 in Kraków, Poland. The presentation discusses using transfer functions and the solution profile to combine multiple functional size measurement methods into a single measurement in accordance with international standards for measurement. This allows defining a functional size unit specific to the transfer function and measurement goals. The solution profile shows the contribution of each size count to measurement precision and goals.
The document discusses various methods for demand forecasting, including qualitative and quantitative approaches. Qualitative methods involve expert judgment through individual specialists, group consensus, or the Delphi method. Quantitative methods include causal models using regression analysis and time series analysis. Simple time series models discussed are the projection method, simple moving average, weighted moving average, and basic exponential smoothing. Accuracy of forecasts is also addressed through measures like average error, bias, and mean absolute deviation. The document provides an example application of these time series models to a sample demand data set.
Computational optimization, modelling and simulation: Recent advances and ove...Xin-She Yang
This document summarizes recent advances in computational optimization, modeling, and simulation. It discusses how optimization is important for engineering design and industrial applications to maximize profits and minimize costs. Metaheuristic algorithms and surrogate-based optimization techniques are becoming widely used for complex optimization problems. The workshop accepted papers that applied optimization, modeling, and simulation to diverse areas like production planning, mixed-integer programming, electromagnetics, and reliability analysis. Overall computational optimization and modeling have broad applications and continued research is needed in areas like metaheuristic convergence and surrogate modeling methods.
The document discusses using machine learning algorithms and supervised learning methods to develop an automated system for detecting nanoparticles and estimating their size and spatial distribution from scanning electron microscope images. The goal is to enable industrial-scale manufacturing of nanomaterials by applying quality control tools. Specifically, the research uses support vector machines and scale-invariant feature transform to extract features from images and classify pixels as nanorods or background in order to predict locations and dimensions of nanorods.
Similar to Machine Learning Strategies for Time Series Prediction (20)
A statistical criterion for reducing indeterminacy in linear causal modelingGianluca Bontempi
This document proposes a new statistical criterion called C to help distinguish between causal patterns in completely connected triplets when inferring causal relationships from observational data. The criterion is based on differences in values of the term S, which is derived from the covariance matrix, between different causal hypotheses. This criterion informs an algorithm called RC that incorporates both relevance and causal measures to iteratively select variables. Experiments on linear and nonlinear networks show RC has higher accuracy than other algorithms at inferring network structure. The criterion C and RC algorithm help address challenges of causal inference from complex data where dependencies are frequent.
Adaptive model selection in Wireless Sensor NetworksGianluca Bontempi
This document discusses challenges in using wireless sensor networks for environmental monitoring applications. It notes that sensor nodes have limited energy, which poses challenges for applications that need to run for months or years. The document describes the hardware capabilities of wireless sensor nodes and their energy consumption during different operating modes. It also provides an overview of using machine learning models to model sensor measurements over time and across sensors, with the goal of reducing energy usage through adaptive model selection.
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi
This document discusses approaches to feature selection for machine learning models, specifically comparing global versus local modeling techniques. It proposes combining lazy learning, racing, and subsampling for effective feature selection. Lazy learning uses local linear models for prediction rather than global nonlinear models, improving computational efficiency when many predictions are needed. Racing and subsampling allow efficient evaluation of feature subsets during wrapper-based feature selection by discarding poor-performing subsets early based on statistical tests of performance on subsets of the data. Experimental results are said to validate this combined approach for feature selection.
A model-based relevance estimation approach for feature selection in microarr...Gianluca Bontempi
This document presents a model-based approach for estimating feature relevance for feature selection in microarray datasets. It aims to provide an unbiased relevance estimation between filter and wrapper methods. The approach combines a low-bias k-nearest neighbor cross-validation error estimator with either a direct probability model estimator or a mutual information filter estimator to reduce variance. Experimental results on 20 public microarray datasets compare the proposed combined estimators to a support vector machine wrapper approach.
This document discusses using feature selection techniques to address the curse of dimensionality in microarray data analysis. It presents the problem of having many more features than samples in bioinformatics tasks like cancer classification and network inference. It describes filter, wrapper and embedded feature selection approaches and proposes a blocking strategy that uses multiple learning algorithms to evaluate feature subsets in order to improve selection robustness when samples are limited. Finally, it lists several microarray gene expression datasets that are commonly used to evaluate feature selection methods.
FP7 evaluation & selection: the point of view of an evaluatorGianluca Bontempi
The document discusses the process of evaluating proposals for EU funding as an EU evaluator. It begins by introducing the author's expertise and background evaluating FP6 and FP7 proposals. It then outlines the evaluation process, which involves individual evaluation of assigned proposals followed by consensus building and panel evaluation. Key aspects covered include managing conflicts of interest, maintaining confidentiality, and adhering to a code of conduct. The evaluation criteria for integrated projects focus on relevance to program objectives, potential impact, scientific and technological excellence, quality of consortium, and quality of management.
Local modeling in regression and time series predictionGianluca Bontempi
The document discusses global modeling versus local modeling approaches for regression and time series prediction problems. Global modeling fits a single analytical function to all input data, while local modeling performs separate fits to subsets of nearby data points. The document outlines the local modeling approach using lazy learning, which stores all training data and performs local fits when making predictions for new query points. It then applies lazy learning techniques to problems in regression, time series prediction, and feature selection.
This document discusses feature selection methods for causal inference in bioinformatics. It describes how relevance and causality differ, with relevant features not always being causal. Information theory concepts like mutual information, conditional mutual information, and interaction information are introduced to quantify dependence and independence between variables. The min-Interaction Max-Relevance (mIMR) filter method is proposed to select features based on both relevance to the target and minimal interaction, approximating causal relationships. Experimental results on breast cancer gene expression datasets show mIMR outperforms conventional ranking in predictive performance, identifying a potential causal signature for survival.
Why You Need Real-Time Data to Compete in E-CommercePromptCloud
In the fast-paced world of e-commerce, real-time data is crucial for staying competitive. By accessing up-to-date information on market trends, competitor pricing, and customer preferences, businesses can make informed decisions quickly. Real-time data enables dynamic pricing strategies, effective inventory management, and personalized marketing efforts, all of which are essential for meeting customer demands and outperforming competitors. Embrace real-time data to stay agile, optimize your operations, and drive growth in the ever-evolving e-commerce landscape. Get in touch for custom web scraping services: https://bit.ly/3WkqYVm
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Databricks Vs Snowflake off Page PDF submission.pptxdewsharon760
Discover the key differences between Databricks and Snowflake. Learn about their features, use cases, and how to choose the right data platform for your business needs.
Hadoop Vs Snowflake Blog PDF Submission.pptxdewsharon760
Explore the key differences between Hadoop and Snowflake. Understand their unique features, use cases, and how to choose the right data platform for your needs.
Toward a National Research Platform to Enable Data-Intensive Open-Source Sci...
Machine Learning Strategies for Time Series Prediction
1. Machine Learning Strategies for Time
Series Prediction
Machine Learning Summer School
(Hammamet, 2013)
Gianluca Bontempi
Machine Learning Group, Computer Science Department
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 1/128
2. Introducing myself
• 1992: Computer science engineer (Politecnico di Milano, Italy),
• 1994: Researcher in robotics in IRST, Trento, Italy,
• 1995: Researcher in IRIDIA, ULB Artificial Intelligence Lab, Brussels,
• 1996-97: Researcher in IDSIA, Artificial Intelligence Lab, Lugano,
Switzerland,
• 1998-2000: Marie Curie fellowship in IRIDIA, ULB Artificial Intelligence
Lab, Brussels,
• 2000-2001: Scientist in Philips Research, Eindhoven, The Netherlands,
• 2001-2002: Scientist in IMEC, Microelectronics Institute, Leuven,
Belgium,
• since 2002: professor in Machine Learning, Modeling and Simulation,
Bioinformatics in ULB Computer Science Dept., Brussels,
• since 2004: head of the ULB Machine Learning Group (MLG).
• since 2013: director of the Interuniversity Institute of Bioinformatics in
Brussels (IB)2
, ibsquare.be.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 2/128
3. ... and in terms of distances
According to MathSciNet
•• Distance from Erdos= 5
• Distance from Zaiane= 6
• Distance from Deisenroth= 8
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 3/128
4. ULB Machine Learning Group (MLG)
• 3 professors, 10 PhD students, 5 postdocs.
• Research topics: Knowledge discovery from data, Classification, Computational
statistics, Data mining, Regression, Time series prediction, Sensor networks,
Bioinformatics, Network inference.
• Computing facilities: high-performing cluster for analysis of massive datasets,
Wireless Sensor Lab.
• Website: mlg.ulb.ac.be.
• Scientific collaborations in ULB: Hopital Jules Bordet, Laboratoire de Médecine
experimentale, Laboratoire d’Anatomie, Biomécanique et Organogénèse (LABO),
Service d’Anesthesie (ERASME).
• Scientific collaborations outside ULB: Harvard Dana Farber (US), UCL Machine
Learning Group (B), Politecnico di Milano (I), Universitá del Sannio (I), Inst Rech
Cliniques Montreal (CAN).
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 4/128
5. ULB-MLG: recent projects
1. Machine Learning for Question Answering (2013-2014).
2. Adaptive real-time machine learning for credit card fraud detection (2012-2013).
3. Epigenomic and Transcriptomic Analysis of Breast Cancer (2012-2015).
4. Discovery of the molecular pathways regulating pancreatic beta cell dysfunction
and apoptosis in diabetes using functional genomics and bioinformatics: ARC
(2010-2015)
5. ICT4REHAB - Advanced ICT Platform for Rehabilitation (2011-2013)
6. Integrating experimental and theoretical approaches to decipher the molecular
networks of nitrogen utilisation in yeast: ARC (2006-2010).
7. TANIA - Système d’aide à la conduite de l’anesthésie. WALEO II project funded
by the Région Wallonne (2006-2010)
8. "COMP2
SYS" (COMPutational intelligence methods for COMPlex SYStems)
MARIE CURIE Early Stage Research Training funded by the EU (2004-2008).
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 5/128
6. What you are supposed to know
• Basic notions of probability and statistics
• Random variable
• Expectation, variance, covariance
• Least-squares
What you are expected to get acquainted with
• Foundations of statistical machine learning
• How to build a predictive model from data
• Strategies for forecasting
What will remain
• Interest, curiosity for machine learning
• taste for prediction
• Contacts
• Companion webpage
http://www.ulb.ac.be/di/map/gbonte/mlss.html
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 6/128
7. Outline
• Notions of time series (30 mins)
• conditional probability
• Machine learning for prediction (45 mins)
• bias/variance
• parametric and structural identification
• validation
• model selection
• feature selection
• COFFEE BREAK
• Local learning (15 mins)
• Forecasting: one-step and multi-step-ahed (30 mins)
• Some applications (15 mins)
• time series competitions
• wireless sensor
• biomedical
• Future directions and perspectives (15 mins)
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 7/128
8. What is machine learning?
Machine learning is that domain of computational intelligence which is
concerned with the question of how to construct computer programs that
automatically improve with experience. [16]
Reductionist attitude: ML is just a buzzword which equates to statistics plus
marketing
Positive attitude: ML paved the way to the treatment of real problems related
to data analysis, sometimes overlooked by statisticians (nonlinearity,
classification, pattern recognition, missing variables, adaptivity,
optimization, massive datasets, data management, causality,
representation of knowledge, parallelisation)
Interdisciplinary attitude: ML should have its roots on statistics and
complements it by focusing on: algorithmic issues, computational
efficiency, data engineering.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 8/128
9. Why study machine learning?
• Machine learning is cool.
• Practical way to understand: All models are wrong but some are useful...
• The fastest way to become a data scientist ... the sexiest job in the 21st
century
• Someone who knows statistics better than a computer scientists and
programs better than a statistician...
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 9/128
10. Notion of time series
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 10/128
11. Time series
Definition A time series is a sequence of observations st ∈ R, usually
ordered in time.
Examples of time series in every scientific and applied domain:
• Meteorology: weather variables, like temperature, pressure, wind.
• Economy and finance: economic factors (GNP), financial indexes,
exchange rate, spread.
• Marketing: activity of business, sales.
• Industry: electric load, power consumption, voltage, sensors.
• Biomedicine: physiological signals (EEG), heart-rate, patient
temperature.
• Web: clicks, logs.
• Genomics: time series of gene expression during cell cycle.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 11/128
12. Why studying time series?
There are various reasons:
Prediction of the future based on the past.
Control of the process producing the series.
Understanding of the mechanism generating the series.
Description of the salient features of the series.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 12/128
13. Univariate discrete time series
• Quantities, like temperature and voltage, change in a continuous way.
• In practice, however, the digital recording is made discretely in time.
• We shall confine ourselves to discrete time series (which however take
continuous values).
• Moreover we will consider univariate time series, where one type of
measurement is made repeatedly on the same object or individual.
• Multivariate time series are out of the scope of this presentation but
represent an important topic in the domain.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 13/128
14. A general model
Let an observed discrete univariate time series be s1, . . . , sT . This means
that we have T numbers which are observations on some variable made at T
equally distant time points, which for convenience we label 1, 2, . . . , T.
A fairly general model for the time series can be written
st = g(t) + ϕt t = 1, . . . , T
The observed series is made of two components
Systematic part: g(t), also called signal or trend, which is a determistic
function of time
Stochastic sequence: a residual term ϕt, also called noise, which follows a
probability law.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 14/128
15. Types of variation
Traditional methods of time-series analysis are mainly concerned with
decomposing the variation of a series st into:
Trend : this is a long-term change in the mean level, eg. an increasing trend.
Seasonal effect : many time series (sale figures, temperature readings) exhibit
variation which is seasonal (e.g. annual) in period. The measure and the
removal of such variation brings to deseasonalized data.
Irregular fluctuations : after trend and cyclic variations have been removed
from a set of data, we are left with a series of residuals, which may or
may not be completely random.
We will assume here that once we have detrended and deseasonalized the
series, we can still extract information about the dependency between the
past and the future. Henceforth ϕt will denote the detrended and
deseasonalized series.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 15/128
17. Probability and dependency
• Forecasting a time series is possible since future depends on the past or
analogously because there is a relationship between the future and the
past. However this relation is not deterministic and can hardly be written
in an analytical form.
• An effective way to describe a nondeterministic relation between two
variables is provided by the probability formalism.
• Consider two continuous random variables ϕ1 and ϕ2 representing for
instance the temperature today (time t1) and tomorrow (t2). We tend to
believe that ϕ1 could be used as a predictor of ϕ2 with some degree of
uncertainty.
• The stochastic dependency between ϕ1 and ϕ2 is resumed by the joint
density p(ϕ1, ϕ2) or equivalently by the conditional probability
p(ϕ2|ϕ1) =
p(ϕ1, ϕ2)
p(ϕ1)
• If p(ϕ2|ϕ1) = p(ϕ2) then ϕ1 and ϕ2 are not independent or equivalently
the knowledge of the value of ϕ1 reduces the uncertainty about ϕ2.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 17/128
18. Stochastic processes
• The stochastic approach to time series makes the assumption that a
time series is a realization of a stochastic process (like tossing an
unbiased coin is the realization of a discrete random variable with equal
head/tail probability).
• A discrete-time stochastic process is a collection of random variables ϕt,
t = 1, . . . , T defined by a joint density
p(ϕ1, . . . , ϕT )
• Statistical time-series analysis is concerned with evaluating the
properties of the probability model which generated the observed time
series.
• Statistical time-series modeling is concerned with inferring the properties
of the probability model which generated the observed time series from a
limited set of observations.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 18/128
19. Strictly stationary processes
• Predicting a time series is possible if and only if the dependence
between values existing in the past is preserved also in the future.
• In other terms, though measures change, the stochastic rule underlying
their realization does not. This aspect is formalized by the notion of
stationarity.
• Definition A stochastic process is said to be strictly stationary if the joint
distribution of ϕt1
, ϕt2
, . . . , ϕtn
is the same as the joint distribution of
ϕt1+k, ϕt2+k, . . . , ϕtn+k for all n, t1, . . . , tn and k.
• In other words shifting the time origin by an amount k has no effect on
the joint distribution which depends only on the intervals between
t1, . . . , tn.
• This implies that the (marginal) distribution of ϕt is the same for all t.
• The definition holds for any value of n.
• Let us see what does it mean in practice for n = 1 and n = 2.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 19/128
20. Properties
n=1 : If ϕt is strictly stationary and its first two moments are finite, we have
E[ϕt] = µt = µ, Var [ϕt] = σ2
t = σ2
n=2 : Furthermore the autocovariance function γ(t1, t2) depends only on the
lag k = t2 − t1 and may be written by
γ(k) = Cov[ϕt, ϕt+k] = E (ϕt − µ)(ϕt+k − µ)
• In order to avoid scaling effects, it is useful to introduce the
autocorrelation function
ρ(k) =
γ(k)
σ2
=
γ(k)
γ(0)
• Another relevant function is the the partial autocorrelation function π(k)
where π(k), k > 1 measures the degree of association between ϕt and
ϕt−k when the effects of the intermediary lags 1, . . . , k − 1 are removed
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 20/128
21. Weak stationarity
• A less restricted definition of stationarity concerns only the first two
moments of ϕt
Definition A process is called second-order stationary or weakly stationary
if its mean is constant and its autocovariance function depends only on
the lag.
• No assumptions are made about higher moments than those of second
order.
• Strict stationarity implies weak stationarity but not viceversa in general.
Definition A process is called normal is the joint distribution of
ϕt1
, ϕt2
, . . . , ϕtn
is multivariate normal for all t1, . . . , tn.
• In the special case of normal processes, weak stationarity implies strict
stationarity. This is due to the fact that a normal process is completely
specified by the mean and the autocovariance function.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 21/128
22. Estimators of first moments
Here you will find some common estimators of the two first moments of a
time series:
• The empirical mean is given by
ˆµ =
T
t=1 ϕt
T
• The empirical autocovariance function is given by
ˆγ(k) =
T −k
t=1 (ϕt − ˆµ)(ϕt+k − ˆµ)
T − k − 1
, k < T − 2
• The empirical autocorrelation function is given by
ˆρ(k) =
ˆγ(k)
ˆγ(0)
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 22/128
23. Some examples of stochastic
processes
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 23/128
24. Purely random processes
• It consists of a sequence of random variables ϕt which are mutually
independent and identically distributed. For each t and k
p(ϕt+k|ϕt) = p(ϕt+k)
• It follows that this process has constant mean and variance. Also
γ(k) = Cov[ϕt, ϕt+k] = 0
for k = ±1, ±2, . . . .
• A purely random process is strictly stationary.
• A purely random process is sometimes called white noise by engineers.
• An example of purely random process is the series of numbers drawn by
a roulette wheel in a casino.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 24/128
25. Example: Gaussian purely random
0 200 400 600 800 1000
−3−2−10123
White noise
t
y
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 25/128
26. Example: autocorrelation function
0 5 10 15 20 25 30
0.00.20.40.60.81.0
Lag
ACF
Series y
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 26/128
27. Random walk
• Suppose that wt is a discrete, purely random process with mean µ and
variance σ2
w.
• A process ϕt is said to be a random walk if
ϕt = ϕt−1 + wt
• The next value of a random walk is obtained by summing a random
shock to the latest value.
• If ϕ0 = 0 then
ϕt =
t
i=1
wi
• E[ϕt] = tµ and Var [ϕt] = tσ2
w.
• As the mean and variance change with t the process is non-stationary.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 27/128
28. Random walk (II)
• The first differences of a random walk given by
∇ϕt = ϕt − ϕt−1
form a purely random process, which is stationary.
• Examples of time series which behave like random walks are
• stock prices on successive days.
• the path traced by a molecule as it travels in a liquid or a gas,
• the search path of a foraging animal
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 28/128
29. Ten random walks
Let w ∼ N(0, 1).
0 50 100 150 200 250 300 350 400 450 500
−40
−30
−20
−10
0
10
20
30
40
50
60
Random walks
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 29/128
30. Autoregressive processes
• Suppose that wt is a purely random process with mean zero and
variance σ2
w.
• A process ϕt is said to be an autoregressive process of order n (also an
AR(n) process) if
ϕt = α1ϕt−1 + · · · + αnϕt−n + wt
• This means that the next value is a linear weighted sum of the past n
values plus a random shock.
• Finite memory filter.
• If w is a normal variable, ϕt will be normal too.
• Note that this is like a linear regression model where ϕ is regressed not
on independent variables but on its past values (hence the prefix “auto”).
• The properties of stationarity depends on the values αi, i = 1, . . . , n.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 30/128
31. First order AR(1) process
If n = 1, we have the so-called Markov process AR(1)
ϕt = αϕt−1 + wt
By substitution it can be shown that
ϕt = α(αϕt−2 + wt−1) + wt = α2
(αϕt−3 + wt−2) + αwt−1 + wt =
= wt + αwt−1 + α2
wt−2 + . . .
Then
E[ϕt] = 0 Var [ϕt] = σ2
w(1 + α2
+ α4
+ . . . )
Then if |α| < 1 the variance if finite and equals
Var [ϕt] = σ2
ϕ = σ2
w/(1 − α2
)
and the autocorrelation is
ρ(k) = αk
k = 0, . . . , 1, 2
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 31/128
32. General order AR(n) process
• It has been shown that condition necessary and sufficient for the
stationarity is that the complex roots of the equation
φ(z) = 1 − α1z − · · · − αnzn
= 0
lie outside the unit circle.
• The autocorrelation function of an AR(n) attenuates slowly with the lag k
(exponential decay or damped sine wave pattern).
• On the contrary the partial autocorrelation function cuts off at k > n, i.e.
it is not significantly different from zero beyond the lag n.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 32/128
33. Example: AR(2)
0 200 400 600 800 1000
051015
AR(2)
t
y
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 33/128
34. Example: AR(2)
0 5 10 15 20 25 30
−0.50.00.51.0
Lag
ACF
Series y
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 34/128
35. Partial autocorrelation in AR(2)
0 5 10 15 20 25 30
−0.50.00.5
Lag
PartialACF
Series y
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 35/128
36. Fitting an autoregressive process
The estimation of an autoregressive process to a set of data
DT = {ϕ1, . . . , ϕT } demands the resolution of two problems:
1. The estimation of the order n of the process. This is typically supported
by the analysis of the partial autocorrelation function.
2. The estimation of the set of parameters {α1, . . . , αn}.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 36/128
37. Estimation of AR(n) parameters
Suppose we have an AR(n) process of order n
ϕt = α1ϕt−1 + · · · + αpϕt−n + wt
Given T observations, the parameters may be estimated by least-squares by
minimizing
ˆα = arg min
α
T
t=n+1
[ϕt − α1ϕt−1 − · · · − αnϕt−n]
2
In matrix form this amounts to solve the multiple least-squares problem
Y = Xα where
X =
ϕT −1 ϕT −2 . . . ϕT −n−1
ϕT −2 ϕT −3 . . . ϕT −n−2
...
...
...
...
ϕn ϕn−1 . . . ϕ1
Y =
ϕT
ϕT −1
...
ϕn+1
(1)
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 37/128
38. Least-squares estimation of AR(n) parms
• Let N be the number of rows of X. In order to estimate the AR(n)
parameters we compute the least-squares estimator
ˆα = arg min
a
N
i=1
(yi − xT
i a)2
= arg min
a
(Y − Xa)T
(Y − Xa)
• It can be shown that
ˆα = (XT
X)−1
XT
Y
where the XT
X matrix is a symmetric [n × n] matrix which plays an
important role in multiple linear regression.
• Conventional linear regression theory provides also confidence interval
and significativity tests for the AR(n) coefficients.
• A recursive version of least-squares, i.e. where time samples arrive
sequentially, is provided by the RLS algorithm.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 38/128
39. From linear to nonlinear setting
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 39/128
40. The NAR representation
• AR models assume that the relation between past and future is linear
• Once we assume that the linear assumption does not hold, we may
extend the AR formulation to a Nonlinear Auto Regressive (NAR)
formulation
ϕt = f (ϕt−1, ϕt−2, . . . , ϕt−n) + w(t)
where the missing information is lumped into a noise term w.
• In what follows we will consider this relationship as a particular instance
of a dependence
y = f(x) + w
between a multidimensional input x ∈ X ⊂ Rn
and a scalar output y ∈ R.
• NOTA BENE. In what follows y will denote the next value ϕt to be
predicted and
x = [ϕt−1, ϕt−2, . . . , ϕt−n]
will denote the n-dimensional input vector also known as embedding
vector.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 40/128
41. Nonlinear vs. linear time series
The advantage of linear models are numerous:
• the least-squares ˆα estimate can be expressed in an analytical form
• the least-squares ˆα estimate can be easily calculated through matrix
computation.
• statistical properties of the estimator can be easily defined.
• recursive formulation for sequential updating are avaialble.
• the relation between empirical and generalization error is known,
BUT...
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 41/128
42. Nonlinear vs. linear time series
• linear methods interpret all the structure in a time series through linear
correlation
• deterministic linear dynamics can only lead to simple exponential or
periodically oscillating behavior, so all irregular behavior is attributed to
external noise while deterministic nonlinear equations could produce
very irregular data,
• in real problems it is extremely unlikely that the variables are linked by a
linear relation.
In practice, the form of the relation is often unknown and only a limited
amount of samples is available.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 42/128
43. Machine learning for prediction
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 43/128
44. Supervised learning
TRAINING
DATASET
UNKNOWN
DEPENDENCY
INPUT OUTPUT
ERROR
PREDICTION
MODEL
PREDICTION
From now on we consider the prediction problem as a problem of supervised
learning problem, where we have to infer from historical data the possibly
nonlinear dependance between the input (past embedding vector) and the
output (future value).
Statistical machine learning is the discipline concerned with this problem.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 44/128
45. The regression plus noise form
• A typical way of representing the unknown input/output relation is the
regression plus noise form
y = f(x) + w
where f(·) is a deterministic function and the term w represents the
noise or random error. It is typically assumed that w is independent of x
and E[w] = 0.
• Suppose that we have available a training set { xi, yi : i = 1, . . . , N},
where xi = (xi1, . . . , xin) and yi, generated according to the previous
model.
• The goal of a learning procedure is to estimate a model ˆf(x) which is
able to give a good approximation of the unknown function f(x).
• But how to choose ˆf, if we do not know the probability distribution
underlying the data and we have only a limited training set?
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 45/128
46. A simple example with n = 1
−2 −1 0 1 2
−5051015
x
Y
NOTA BENE: this is NOT a time series ! y = ϕt, x = ϕt−1. The horizontal axis
does not represent time but the past value of the series.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 46/128
47. Model degree 1
−2 −1 0 1 2
−5051015
x
Y
Training error= 2 degree= 1
ˆf(x) = α0 + α1x
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 47/128
48. Model degree 3
−2 −1 0 1 2
−5051015
x
Y
Training error= 0.92 degree= 3
ˆf(x) = α0 + α1x + · · · + α3x3
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 48/128
49. Model degree 18
−2 −1 0 1 2
−5051015
x
Y
Training error= 0.4 degree= 18
ˆf(x) = α0 + α1x + · · · + α18x1
8
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 49/128
50. Generalization and overfitting
• How to estimate the quality of a model? Is the training error a good
measure of the quality?
• The goal of learning is to find a model which is able to generalize, i.e.
able to return good predictions for input values independent of the
training set.
• In a nonlinear setting, it is possible to find models with such a complicate
structure that they have null training errors. Are these models good?
• Typically NOT. Since doing very well on the training set could mean
doing badly on new data.
• This is the phenomenon of overfitting.
• Using the same data for training a model and assessing it is typically a
wrong procedure, since this returns an over optimistic assessment of the
model generalization capability.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 50/128
51. Bias and variance of a model
• A fundamental result of estimation theory shows that the
mean-squared-error, i.e. a measure of the generalization quality of an
estimator can be decomposed into three terms:
MISE = σ2
w + squared bias + variance
where the intrinsic noise term reflects the target alone, the bias reflects
the target’s relation with the learning algorithm and the variance term
reflects the learning algorithm alone.
• This result is purely theoretical since these quantities cannot be
measured on the basis of a finite amount of data.
• However, this result provides insight about what makes accurate a
learning process.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 51/128
52. The bias/variance trade-off
• The first term is the variance of y around its true mean f(x) and cannot
be avoided no matter how well we estimate f(x), unless σ2
w = 0.
• The bias measures the difference in x between the average of the
outputs of the hypothesis functions ˆf over the set of possible DN and
the regression function value f(x)
• The variance reflects the variability of the guessed ˆf(x, αN ) as one
varies over training sets of fixed dimension N. This quantity measures
how sensitive the algorithm is to changes in the data set, regardless to
the target.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 52/128
53. The bias/variance dilemma
• The designer of a learning machine has not access to the term MISE but
can only estimate it on the basis of the training set. Hence, the
bias/variance decomposition is relevant in practical learning since it
provides a useful hint about the features to control in order to make the
error MISE small.
• The bias term measures the lack of representational power of the class
of hypotheses. To reduce the bias term we should consider complex
hypotheses which can approximate a large number of input/output
mappings.
• The variance term warns us against an excessive complexity of the
approximator. This means that a class of too powerful hypotheses runs
the risk of being excessively sensitive to the noise affecting the training
set; therefore, our class could contain the target but it could be
practically impossible to find it out on the basis of the available dataset.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 53/128
54. • In other terms, it is commonly said that an hypothesis with large bias but
low variance underfits the data while an hypothesis with low bias but
large variance overfits the data.
• In both cases, the hypothesis gives a poor representation of the target
and a reasonable trade-off needs to be found.
• The task of the model designer is to search for the optimal trade-off
between the variance and the bias term, on the basis of the available
training set.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 54/128
56. The learning procedure
A learning procedure aims at two main goals:
1. to choose a parametric family of hypothesis ˆf(x, α) which contains or
gives good approximation of the unknown function f (structural
identification).
2. within the family ˆf(x, α), to estimate on the basis of the training set DN
the parameter αN which best approximates f (parametric identification).
In order to accomplish that, a learning procedure is made of two nested
loops:
1. an external structural identification loop which goes through different
model structures
2. an inner parametric identification loop which searches for the best
parameter vector within the family structure.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 56/128
57. Parametric identification
The parametric identification of the hypothesis is done according to ERM
(Empirical Risk Minimization) principle where
αN = α(DN ) = arg min
α∈Λ
MISEemp(α)
minimizes the training error
MISEemp(α) =
N
i=1 yi − ˆf(xi, α)
2
N
constructed on the basis of the training data set DN .
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 57/128
58. Parametric identification (II)
• The computation of αN requires a procedure of multivariate optimization
in the space of parameters.
• The complexity of the optimization depends on the form of ˆf(·).
• In some cases the parametric identification problem may be an NP-hard
problem.
• Thus, we must resort to some form of heuristic search.
• Examples of parametric identification procedure are linear least-squares
for linear models and backpropagated gradient-descent for feedforward
neural networks.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 58/128
59. Model assessment
• We have seen before that the training error is not a good estimator (i.e. it
is too optimistic) of the generalization capability of the learned model.
• Two alternative exists:
1. Complexity-based penalty criteria
2. Data-driven validation techniques
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 59/128
60. Complexity-based penalization
• In conventional statistics, various criteria have been developed, often in
the context of linear models, for assessing the generalization
performance of the learned hypothesis without the use of further
validation data.
• Such criteria take the form of a sum of two terms
ˆGPE = MISEemp + complexity term
where the complexity term represents a penalty which grows as the
number of free parameters in the model grows.
• This expression quantifies the qualitative consideration that simple
models return high training error with a reduced complexity term while
complex models have a low training error thanks to the high number of
parameters.
• The minimum for the criterion represents a trade-off between
performance on the training set and complexity.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 60/128
61. Complexity-based penalty criteria
If the input/output relation is linear, well-known examples of complexity based
criteria are:
• the Final Prediction Error (FPE)
FPE = MISEemp(αN )
1 + p/N
1 − p/N
with p = n + 1,
• the Generalized Cross-Validation (GCV)
GCV = MISEemp(αN )
1
(1 − p
N )2
• the Akaike Information Criterion (AIC)
AIC =
p
N
−
1
N
L(αN )
where L(·) is the log-likelihood function,
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 61/128
62. • the Cp criterion proposed by Mallows
Cp =
MISEemp(αN )
ˆσ2
w
+ 2p − N
where ˆσ2
w is an estimate of the variance of noise,
• the Predicted Squared Error (PSE)
PSE = MISEemp(αN ) + 2ˆσ2
w
p
N
where ˆσ2
w is an estimate of the variance of noise.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 62/128
63. Data-driven validation techniques
If no (e.g. linear) assumptions are made, how to measure MISE in a reliable
way on a finite dataset? The most common techniques to return an estimate
MISE are
Testing: a testing sequence independent of DN and distributed according to
the same probability distribution is used to assess the quality. In
practice, unfortunately, an additional set of input/output observations is
rarely available.
Holdout: The holdout method, sometimes called test sample estimation,
partitions the data DN into two mutually exclusive subsets, the training
set Dtr and the holdout or test set DNts .
k-fold Cross-validation: the set DN is randomly divided into k mutually
exclusive test partitions of approximately equal size. The cases not
found in each test partition are independently used for selecting the
hypothesis which will be tested on the partition itself. The average error
over all the k partitions is the cross-validated error rate.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 63/128
64. The K-fold cross-validation
This is the algorithm in detail:
1. split the dataset DN into k roughly equal-sized parts.
2. For the kth part k = 1, . . . , K, fit the model to the other K − 1 parts of the
data, and calculate the prediction error of the fitted model when
predicting the k-th part of the data.
3. Do the above for k = 1, . . . , K and average the K estimates of prediction
error.
Let k(i) be the part of DN containing the ith sample. Then the
cross-validation estimate of the MISE prediction error is
MISECV =
1
N
N
i=1
(yi − ˆy
−k(i)
i )2
=
1
N
N
i=1
yi − ˆf(xi, α−k(i)
)
2
where ˆy
−k(i)
i denotes the fitted value for the ith observation returned by the
model estimated with the k(i)th part of the data removed.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 64/128
65. 10-fold cross-validation
K = 10: at each iteration 90% of data are used for training and the remaining
10% for the test.
90%
10%
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 65/128
66. Leave-one-out cross validation
• The cross-validation algorithm where K = N is also called the
leave-one-out algorithm.
• This means that for each ith sample, i = 1, . . . , N,
1. we carry out the parametric identification, leaving that observation
out of the training set,
2. we compute the predicted value for the ith observation, denoted by
ˆy−i
i
The corresponding estimate of the MISE prediction error is
MISELOO =
1
N
N
i=1
(yi − ˆy−i
i )2
=
1
N
N
i=1
(yi − ˆf xi, α−i
)
2
where α−i
is the set of parameters returned by the parametric identification
perfomed on the training set with the ith sample set aside.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 66/128
67. Model selection
• Model selection concerns the final choice of the model structure
• By structure we mean:
• family of the approximator (e.g. linear, non linear) and if nonlinear
which kind of learner (e.g. neural networks, support vector
machines, nearest-neighbours, regression trees)
• the value of hyper parameters (e.g. number of hidden layers, number
of hidden nodes in NN, number of neighbors in KNN, number of
levels in trees)
• number and set of input variables
• this choice is typically the result of a compromise between different
factors, like the quantitative measures, the personal experience of the
designer and the effort required to implement a particular model in
practice.
• Here we will consider only quantitative criteria. Two are the possible
approaches:
1. the winner-takes-all approach
2. the combination of estimators approach.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 67/128
68. Model selection
N
REALIZATION
STOCHASTIC
PROCESS
VALIDATION
CLASSES of HYPOTHESIS
LEARNED MODEL
TRAINING
SET
MODEL SELECTION
PARAMETRIC IDENTIFICATION
IDENTIFICATION
,,
, ,
,
,
STRUCTURAL
α
?
ΛSΛ2Λ1
GN
1
α
1
N
α
1
N
α
2
N
α
2
N
αN
s
αN
s
GN
2
GN
S
GN
2
GN
1
GN
S
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 68/128
69. Winner-takes-all
The best hypothesis is selected in the set {αs
N }, with s = 1, . . . , S, according
to
˜s = arg min
s=1,...,S
MISE
s
A model with complexity ˜s is trained on the whole dataset DN and used for
future predictions.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 69/128
70. Winner-takes-all pseudo-code
1. for s = 1, . . . , S: (Structural loop)
• for j = 1, . . . , N
(a) Inner parametric identification (for l-o-o):
αs
N−1 = arg min
α∈Λs
i=1:N,i=j
(yi − ˆf(xi, α))2
(b) ej = yj − ˆf(xj, αs
N−1)
• MISELOO(s) = 1
N
N
j=1 e2
j
2. Model selection: ˜s = arg mins=1,...,S MISELOO(s)
3. Final parametric identification:
α˜s
N = arg minα∈Λ˜s
N
i=1(yi − ˆf(xi, α))2
4. The output prediction model is ˆf(·, α˜s
N )
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 70/128
71. Model combination
• The winner-takes-all approach is intuitively the approach which should
work the best.
• However, recent results in machine learning show that the performance
of the final model can be improved not by choosing the model structure
which is expected to predict the best but by creating a model whose
output is the combination of the output of models having different
structures.
• The reason is that in reality any chosen hypothesis ˆf(·, αN ) is only an
estimate of the real target and, like any estimate, is affected by a bias
and a variance term.
• Theoretical results on the combination of estimators show that the
combination of unbiased estimators leads an unbiased estimator with
reduced variance.
• This principle is at the basis of approaches like bagging or boosting.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 71/128
72. Feature selection problem
• Machine learning algorithms are known to degrade in performance
(prediction accuracy) when faced with many inputs (aka features) that
are not necessary for predicting the desired output.
• In the feature selection problem, a learning algorithm is faced with the
problem of selecting some subset of features upon which to focus its
attention, while ignoring the rest.
• Using all available features may negatively affect generalization
performance, especially in the presence of irrelevant or redundant
features.
• Feature selection can be seen as an instance of model selection
problem.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 72/128
73. Benefits and drawbacks of feature selection
There are many potential benefits of feature selection:
• facilitating data visualization and data understanding,
• reducing the measurement and storage requirements,
• reducing training and utilization times of the final model,
• defying the curse of dimensionality to improve prediction performance.
Drawbacks are
• the search for a subset of relevant features introduces an additional layer
of complexity in the modelling task. The search in the model hypothesis
space is augmented by another dimension: the one of finding the
optimal subset of relevant features.
• additional time for learning.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 73/128
74. Curse of dimensionality
• The error of the best model decreases with n but the mean integrated
squared error of models increases faster than linearly in n.
• In high dimensions, all data sets are sparse.
• In high dimensions, the number of possible models to consider
increases superexponenetially in n.
• In high dimensions, all datasets show multicollinearity.
• As n increases the amount of local data goes to zero.
• For a uniform distribution around a query point xq the amount of data
that are contained in a ball of radius r < 1 centered in xq grows like rn
.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 74/128
75. Local density for large n
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
r
fractionoflocal
points
n=1
n=2
n=3
n=100
The size of the neighborhood on which we can estimate local features of the
output (e.g. E[y|x]) increases with dimension n, making the estimation
coarser and coarser.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 75/128
76. Methods of feature selection
Two are the main approaches to feature selection:
Filter methods: they are preprocessing methods. They attempt to assess the
merits of features from the data, ignoring the effects of the selected
feature subset on the performance of the learning algorithm. Examples
are methods that select variables by ranking them through compression
techniques (like PCA or clustering) or by computing correlation with the
output.
Wrapper methods: these methods assess subsets of variables according to
their usefulness to a given predictor. The method conducts a search for
a good subset using the learning algorithm itself as part of the evaluation
function. The problem boils down to a problem of stochastic state space
search. Example are the stepwise methods proposed in linear
regression analysis.
Embedded methods: they perform variable selection as part of the learning
procedure and are usually specific to given learning machines.
Examples are classification trees, random forests, and methods based
on regularization techniques (e.g. lasso)
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 76/128
78. Local modeling procedure
The learning of a local model in xq ∈ Rn
can be summarized in these steps:
1. Compute the distance between the query xq and the training samples
according to a predefined metric.
2. Rank the neighbors on the basis of their distance to the query.
3. Select a subset of the k nearest neighbors according to the bandwidth
which measures the size of the neighborhood.
4. Fit a local model (e.g. constant, linear,...).
Each of the local approaches has one or more structural (or smoothing)
parameters that control the amount of smoothing performed.
Let us focus on the bandwidth selection.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 78/128
79. The bandwidth trade-off: overfit
e
q
0011
0011
0011
01
01
0011
01
0011
0011
00
00
11
11 0
0
1
1
0011
0011
00001111
01
00001111
0011
01
0011
00001111
0011
0011
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
0011
0011
0011
01
01
0011
01
0011
0011
00
00
11
11 0
0
1
1
0011
000
111
00001111
01
00001111
0011
01
0011
00001111
0011
000000
111111
0011
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
Too narrow bandwidth ⇒ overfitting ⇒ large prediction error e.
In terms of bias/variance trade-off, this is typically a situation of high variance.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 79/128
80. The bandwidth trade-off: underfit
e
q
0011
0011
0011
01
01
0011
01
0011
0011
00
00
11
11 0
0
1
1
0011
00001111
01
00001111
0011
01
0011
00001111
0011
0011
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
0011
0011
0011
01
01
0011
000000
111111
0
00
1
11 00
0000
11
1111
0011
000
111
00001111
000
111
01
001100001111
00110011
000000
111111 0011
00001111
000000
111111
0011
000
111
0
00
1
11
000000
111111
00001111
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
Too large bandwidth ⇒ underfitting ⇒ large prediction error e
In terms of bias/variance trade-off, this is typically a situation of high bias.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 80/128
81. Bandwidth and bias/variance trade-off
Mean Squared Error
1/Bandwith
FEW NEIGHBORSMANY NEIGHBORS
Bias
Variance
Underfitting Overfitting
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 81/128
82. The PRESS statistic
• Cross-validation can provide a reliable estimate of the algorithm
generalization error but it requires the training process to be repeated K
times, which sometimes means a large computational effort.
• In the case of linear models there exists a powerful statistical procedure
to compute the leave-one-out cross-validation measure at a reduced
computational cost
• It is the PRESS (Prediction Sum of Squares) statistic, a simple formula
which returns the leave-one-out (l-o-o) as a by-product of the
least-squares.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 82/128
83. Leave-one-out for linear models
PARAMETRIC IDENTIFICATION ON N-1 SAMPLES
PUT THE j-th SAMPLE ASIDE
TEST ON THE j-th SAMPLE
PARAMETRIC IDENTIFICATION
ON N SAMPLES
N TIMES
TRAINING SET
PRESS STATISTIC
LEAVE-ONE-OUT
The leave-one-out error can be computed in two equivalent ways: the slowest
way (on the right) which repeats N times the training and the test procedure;
the fastest way (on the left) which performs only once the parametric
identification and the computation of the PRESS statistic.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 83/128
84. The PRESS statistic
• This allows a fast cross-validation without repeating N times the
leave-one-out procedure. The PRESS procedure can be described as
follows:
1. we use the whole training set to estimate the linear regression
coefficients
ˆα = (XT
X)−1
XT
Y
2. This procedure is performed only once on the N samples and
returns as by product the Hat matrix
H = X(XT
X)−1
XT
3. we compute the residual vector e, whose jth
term is ej = yj − xT
j ˆα,
4. we use the PRESS statistic to compute eloo
j as
eloo
j =
ej
1 − Hjj
where Hjj is the jth
diagonal term of the matrix H.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 84/128
85. The PRESS statistic
Thus, the leave-one-out estimate of the local mean integrated squared error
is:
MISELOO =
1
N
N
i=1
yi − ˆyi
1 − Hii
2
Note that PRESS is not an approximation of the loo error but simply a faster
way of computing it.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 85/128
86. Selection of the number of neighbours
• For a given query point xq, we can compute a set of predictions
ˆyq(k) = xT
q ˆα(k)
, together with a set of associated leave-one-out error vectors
MISELOO(k) for a number of neighbors ranging in [kmin, kmax].
• If the selection paradigm, frequently called winner-takes-all, is adopted,
the most natural way to extract a final prediction ˆyq, consists in
comparing the prediction obtained for each value of k on the basis of the
classical mean square error criterion:
ˆyq = xT
q ˆα(ˆk), with ˆk = arg min
k
MISELOO(k)
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 86/128
87. Local Model combination
• As an alternative to the winner-takes-all paradigm, we can use a
combination of estimates.
• The final prediction of the value yq is obtained as a weighted average of
the best b models, where b is a parameter of the algorithm.
• Suppose the predictions ˆyq(k) and the loo errors MISELOO(k) have been
ordered creating a sequence of integers {ki} so that
MISELOO(ki) ≤ MISELOO(kj), ∀i < j. The prediction of ˆyq is given by
ˆyq =
b
i=1 ζi ˆyq(ki)
b
i=1 ζi
,
where the weights are the inverse of the mean square errors:
ζi = 1/MISELOO(ki).
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 87/128
89. One step-ahead and iterated prediction
• Once a model of the embedding mapping is available, it can be used for
two objectives: one-step-ahead prediction and iterated prediction.
• In one-step-ahead prediction, the n previous values of the series are
available and the forecasting problem can be cast in the form of a
generic regression problem
• In literature a number of supervised learning approaches have been
used with success to perform one-step-ahead forecasting on the basis
of historical data.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 89/128
90. One step-ahead prediction
f
ϕt-2
z-1
z-1
z-1
ϕt-3
ϕt-n
ϕt-1
z-1
ϕt
The approximator ˆf returns the prediction of the value of the time series at
time t + 1 as a function of the n previous values (the rectangular box
containing z−1
represents a unit delay operator, i.e., ϕt−1
= z−1
ϕt
).
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 90/128
91. Nearest-neighbor one-step-ahead forecasts
t
−−
y
t−11t−16 t−1t−6
t
We want to predict at time ¯t − 1 the next value of the series y of order n = 6.
The pattern y¯t−16, y¯t−15, . . . , y¯t−11 is the most similar to the pattern
{y¯t−6, y¯t−5, . . . , yˆt−1}. Then, the prediction ˆy¯t = y¯t−10 is returned.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 91/128
92. Multi-step ahead prediction
• The prediction of the value of a time series H > 1 steps ahead is called
H-step-ahead prediction.
• We classify the methods for H-step-ahead prediction according to two
features: the horizon of the training criterion and the single-output or
multi-output nature of the predictor.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 92/128
93. Multi-step ahead prediction strategies
The most common strategies are
1. Iterated: the model predicts H steps ahead by iterating a
one-step-ahead predictor whose parameters are optimized to minimize
the training error on one-step-ahead forecast (one-step-ahead training
criterion).
2. Iterated strategy where parameters are optimized to minimize the
training error on the iterated htr-step-ahead forecast (htr-step-ahead
training criterion) where 1 < htr ≤ H.
3. Direct: the model makes a direct forecast at time t + h − 1, h = 1, . . . , H
by modeling the time series in a multi-input single-output form
4. Direc: direct forecast but the input vector is extended at each step with
predicted values.
5. MIMO: the model returns a vectorial forecast by modeling the time
series in a multi-input multi-output form
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 93/128
94. Iterated (or recursive) prediction
• In the case of iterated prediction, the predicted output is fed back as
input for the next prediction.
• Here, the inputs consist of predicted values as opposed to actual
observations of the original time series.
• As the feedback values are typically distorted by the errors made by the
predictor in previous steps, the iterative procedure may produce
undesired effects of accumulation of the error.
• Low performance is expected in long horizon tasks. This is due to the
fact that they are essentially models tuned with a one-step-ahead
criterion which is not capable of taking temporal behavior into account.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 94/128
95. Iterated prediction
f
ϕt-2
z-1
z-1
z-1
z-1
ϕt-3
ϕt-n
ϕt-1
z-1
ϕt
The approximator ˆf returns the prediction of the value of the time series at
time t + 1 by iterating the predictions obtained in the previous steps (the
rectangular box containing z−1
represents a unit delay operator, i.e.,
ˆϕt−1
= z−1
ˆϕt
).
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 95/128
96. Iterated with h-step training criterion
• This strategy adopts one-step-ahead predictors but adapts the model
selection criterion in order to take into account the multi-step-ahead
objective.
• Methods like Recurrent Neural Networks belong to such class. Their
recurrent architecture and the associated training algorithm (temporal
backpropagation) are suitable to handle the time-dependent nature of
the data.
• In [4] we proposed an adaptation of the Lazy Learning algorithm where
the number of neighbors is optimized in order to minimize the
leave-one-out error over an horizon larger than one. This technique
ranked second in the 1998 KULeuven Time Series competition.
• A similar technique has been proposed by [14] who won the competition.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 96/128
97. Conventional and iterated leave-one-out
a)
3
1
2
4
5
3
e (3)
cv
1
2
3
4
5
e (3)
b)
it
1
2
4
5
3
1
2 3
4
5
3
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 97/128
98. Santa Fe time series A
0 200 400 600 800 1000
050100150200250
Santa Fe time series A
t
y
The A chaotic time series has a training set of 1000 values: the task is to
predict the continuation for 100 steps, starting from different points.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 98/128
101. Direct strategy
• The Direct strategy [22, 17, 7] learns independently H models fh
ϕt+h−1 = fh(ϕt−1, . . . , ϕt−n) + wt+h−1
with h ∈ {1, . . . , H} and returns a multi-step forecast by concatenating
the H predictions.
• Several machine learning models have been used to implement the
Direct strategy for multi-step forecasting tasks, for instance neural
networks [10], nearest neighbors [17] and decision trees [21].
• Since the Direct strategy does not use any approximated values to
compute the forecasts, it is not prone to any accumulation of errors,
since each model is tailored for the horizon it is supposed to predict.
• Notwithstanding, it has some weaknesses.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 101/128
102. Direct strategy limitations
• Since the H models are learned independently no statistical
dependencies between the predictions ˆϕt+h−1, h = 1, . . . , H [3, 5, 10] is
guaranteed.
• Direct methods often require higher functional complexity [20] than
iterated ones in order to model the stochastic dependency between two
series values at two distant instants [9].
• This strategy demands a large computational time since the number of
models to learn is equal to the size of the horizon.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 102/128
103. DirRec strategy
• The DirRec strategy [18] combines the architectures and the principles
underlying the Direct and the Recursive strategies.
• DirRec computes the forecasts with different models for every
horizon (like the Direct strategy) and, at each time step, it enlarges the
set of inputs by adding variables corresponding to the forecasts of the
previous step (like the Recursive strategy).
• Unlike the previous strategies, the embedding size n is not the same for
all the horizons. In other terms, the DirRec strategy learns H models fh
from the time series where
ϕt+h−1 = fh(ϕt+h−1, . . . , ϕt−n) + wt+h−1
with h ∈ {1, . . . , H}.
• The technique is prone to the curse of dimensionality. The use of feature
selection is recommended for large h.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 103/128
104. MIMO strategy
• This strategy [3, 5] (also known as Joint strategy [10]) avoids the
simplistic assumption of conditional independence between future
values made by the Direct strategy [3, 5] by learning a single
multiple-output model
[ϕt+H−1, . . . , ϕt] = F(ϕt−1, . . . , ϕt−n) + w
where F : Rd
→ RH
is a vector-valued function [15], and w ∈ RH
is a
noise vector with a covariance that is not necessarily diagonal [13].
• The forecasts are returned in one step by a multiple-input
multiple-output regression model.
• In [5] we proposed a multi-output extension of the local learning
algorithm.
• Other multi-output regression model could be taken into consideration
like multi-output neural networks or partial least squares.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 104/128
105. Time series dependencies
t+3
ϕ ϕ
ϕ ϕ
ϕ
t−1 t
t+1 t+2
n = 2 NAR dependency ϕt = f(ϕt−1, ϕt−2) + w(t).
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 105/128
106. Iterated modeling of dependencies
t+3
ϕ ϕ
ϕ
t+1ϕ ϕt+2
tt−1
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 106/128
107. Direct modeling of dependencies
t+3
ϕ ϕ
ϕ ϕ
ϕ
t−1 t
t+1 t+2
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 107/128
108. MIMO strategy
• The rationale of the MIMO strategy is to model, between the predicted
values, the stochastic dependency characterizing the time series. This
strategy avoids the conditional independence assumption made by the
Direct strategy as well as the accumulation of errors which plagues the
Recursive strategy.
• So far, this strategy has been successfully applied to several real-world
multi-step time series forecasting tasks [3, 5, 19, 2].
• However, the wish to preserve the stochastic dependencies constrains
all the horizons to be forecasted with the same model structure. Since
this constraint could reduce the flexibility of the forecasting
approach [19], a variant of the MIMO strategy (called DIRMO) has been
proposed in [19, 2] .
• Extensive validation on the 111 times series of the NN5 competition
showed that MIMO are invariably better than single-output approaches.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 108/128
109. Validation of time series methods
• The huge variety of strategies and algorithms that can be used to infer a
predictor from observed data asks for a rigorous procedure of
comparison and assessment.
• Assessment demands benchmarks and benchmarking procedure.
• Benchmarks can be defined by using
• Simulated data obtained by simulating AR, NAR and other stochastic
processes. This is particular useful for validating theoretical
properties in terms of bias/variance.
• Public domain benchmarks, like the one provided by Time Series
Competitions.
• Real measured data
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 109/128
110. Competitions
• Santa Fe Time Series Prediction and Analysis Competition (1994) [22]:
• International Workshop on Advanced Black-box techniques for nonlinear
modeling Competition (Leuven, Belgium; 1998)
• NN3 competition [8]: 111 monthly time series drawn from homogeneous
population of empirical business time series.
• NN5 competition [1]: 111 time series of the daily retirement amounts
from independent cash machines at different, randomly selected
locations across England.
• Kaggle competition.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 110/128
111. Accuracy measures
Let
et+h = ϕt+h − ˆϕt+h
represent the error of the forecast ˆϕt+h at the horizon h = 1, . . . , H. A
conventional measure of accuracy is the Normalized Mean Squared Error
NMSE =
H
h=1(ϕt+h − ˆϕt+h)2
H
h=1(ϕt+h − ˆµ)2
This quantity is smaller than one if the predictor performs better than the
naivest predictor, i.e. the average ˆµ.
Other measures rely on relative or percentage error
pet+h = 100
ϕt+h − ˆϕt+h
ϕt+h
like
MAPE =
H
h=1 |pet+h|
H
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 111/128
112. Applications in my lab
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 112/128
113. MLG projects on forecasting
• Low-energy streaming of wireless sensor data
• Decision support in anesthesia
• Side-channel attack
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 113/128
114. Low-energy streaming of wireless sensor data
• In monitoring applications, only an approximation to sensor readings is
sufficient (e.g. ±0.5C, ±2% humidity, ..). In this context a Dual Prediction
Scheme is effective.
• A sensor node is provided with a time series prediction model
ϕt = f(ϕt−1, α) (e.g. autoregressive models) and a learning method for
identifying the best set of parameters α (e.g. recursive least squares).
• The sensor node then sends the parameters of the model instead of the
actual data to the recipient. The recipient node then runs the models to
reconstruct an approximation of the data stream collected on the distant
node.
• The sensor node also runs the prediction model. When its predictions
diverges by more than ±ǫ from the actual reading, a new model is sent
to the recipient.
• This allows to reduce the communication effort if an appropriate model is
run by the sensor node.
• Since the cost of transmission is much larger than the cost of
computation, in realistic situations this scheme allows economy of power
consumption. MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 114/128
115. Low-energy streaming of wireless sensor data
0 5 10 15 20
202530354045
Accuracy: 2°C
Constant model
Time (Hour)
Temperature(°C)
G GG GGG GG G G GG G G G G G G G G G
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 115/128
116. Adaptive model selection
Tradeoff: more complex models predict better measurements but have a
higher number of parameters
Model complexity
Metric
Communication costs
Model error
AR(p) : ˆsi[t] =
p
j=1
θjsi[t − j]
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 116/128
117. Adaptive Model Selection
We proposed an Adaptive Model Selection strategy [11] that
• takes into account the cost of sending model parameters,
• allows sensor nodes to determine autonomously the model which best
fits their measurements,
• provides a statistically sound selection mechanism to discard poorly
performing models,
• gave in experimental results about 45% of communication savings on
average,
• was implemented in TinyOS, the reference operating system.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 117/128
118. Predictive modeling in anesthesiology
• During surgery, anesthesiologists controls the depth of anesthesia by
means of types of drugs
• Anesthesiologists observe the patient state by observing
unconsciousness signal in real-time which are collected by monitors
connected via electrodes to the patient’s forehead
• The bispectral index BIS monitor (by Aspect Medical Systems) is a
single dimensionless number between 0 to 100 where 0 equals EEG
silence, 100 is the expected value for a fully awake adult, and between
40 and 60 indicates a recommended level.
• It remains difficult for the anesthesiologist, especially if inexperienced, to
predict how the BIS signal could vary after a change in the administered
anesthetic agents. This is generally due to inter-individual variability
problem
• We designed a ML system [6] to predict multi-step-ahead the evolution
of the BIS on the basis of historical data.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 118/128
119. Predictive modeling in anesthesiology
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 119/128
120. MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 120/128
121. Side channel attack
• In cryptography, a side channel attack is any attack based on the
analysis of measurements related to the physical implementation of a
cryptosystem.
• Side channel attacks take advantage of the fact that instantaneous
power consumption, encryption time or/and electromagnetic leaks of a
cryptographic device depend on the processed data and the performed
operations.
• Power analysis attacks are an instance of side-channel attacks which
assume that different encryption keys imply a different power
consumptions.
• Nowadays, the possibility of collecting a large amount of power traces
(i.e. time series) paves the way to the adoption of machine learning
techniques.
• Effective side channel attacks enables effective countermeasures.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 121/128
122. SCA and time series classification
• The power consumption of a crypto device using a secret key
Q ∈ {0, 1}k
(size k = 8) can be modeled as a time series T(Q)
of order n
T
(Q)
(t+1) = f(T
(Q)
(t) , T
(Q)
(t−1), ..., T
(Q)
(t−n)) + ǫ
• For each key Qj we infer a predictive model ˆf [12] such that
T
(Qj )
(t+1) = ˆfQj (T
(Qj )
(t) , T
(Qj )
(t−1), ..., T
(Qj)
(t−n)) + ǫ
• These models can be used to classify an unlabeled time series T and
predict the associated key by computing a distance for each Qj
D (Qj, T) =
1
N − n + 1
N
t=n
ˆfQj T(t−1), T(t−2), ..., T(t−n−1) − T(t)
2
and choosing the key minimizing it
ˆQ = arg min
j∈[0,2(k−1)]
D (Qj, T)
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 122/128
124. Open-source software
Many commercial solutions exist but only open-source software can cope with
• fast integration of new algorithms
• portability over several platforms
• new paradigms of data storage (e.g. Hadoop)
• integration with different data formats and architectures
A de-facto standard in computational statistics, machine learning,
bioinformatics, geostatistics or more general analytics is nowadays
Highly recommended !
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 124/128
125. All that we didn’t have time to discuss
• ARIMA models
• GARCH models
• Frequence space representation
• Nonstationarity
• Vectorial time series
• Spatio temporal time series
• Time series classification
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 125/128
126. Suggestions for PhD topics
• Time series and big data
• streaming data (environmental data)
• large vectorial time series (weather data)
• Spatio-temporal time series and graphical models
• Beyond cross-validation for model/input selection
• Long term forecasting (effective integration of iterated and directed
models)
• Causality and time-series
• Scalable machine learning
Suggestion: use methods and models to solve problems... not problems to
sanctify methods or models...
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 126/128
127. Conclusion
• Popper claimed that, if a theory is falsifiable (i.e. it can be contradicted
by an observation or the outcome of a physical experiment.), then it is
scientific. Since prediction is the most falsifiable aspect of science it is
also the most scientific one.
• Effective machine learning is an extension of statistics, in no way an
alternative.
• Simplest (i.e. linear) model first.
• Local learning techniques represent an effective trade-off between
linearity and nonlinearity.
• Modelling is more an art than an automatic process... then experience
data analysts are more valuable than expensive tools.
• Expert knowledge matters..., data too
• Understanding what is predictable is as important as trying to predict it.
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 127/128
128. Forecasting is difficult, especially the future
• "Computers in the future may weigh no more than 1.5 tons." –Popular
Mechanics, forecasting the relentless march of science, 1949
• "I think there is a world market for maybe five computers." –Chairman of
IBM, 1943
• "Stocks have reached what looks like a permanently high plateau."
–Professor of Economics, Yale University, 1929.
• "Airplanes are interesting toys but of no military value." –Professor of
Strategy, Ecole Superieure de Guerre.
• "Everything that can be invented has been invented." –Commissioner,
U.S. Office of Patents, 1899.
• "Louis Pasteur’s theory of germs is ridiculous fiction". –Professor of
Physiology at Toulouse, 1872
• "640K ought to be enough for anybody." – Bill Gates, 1981
MLSS 2013, Hammamet - Machine Learning Strategies for Prediction – p. 128/128
129. References
[1] Robert R. Andrawis, Amir F. Atiya, and Hisham El-
Shishiny. Forecast combinations of computational intel-
ligence and linear models for the NN5 time series fore-
casting competition. International Journal of Forecasting,
January 2011.
[2] S. Ben Taieb, A. Sorjamaa, and G. Bontempi. Multiple-
output modelling for multi-step-ahead forecasting. Neuro-
computing, 73:1950–1957, 2010.
[3] G. Bontempi. Long term time series prediction with multi-
input multi-output local learning. In Proceedings of the 2nd
European Symposium on Time Series Prediction (TSP),
ESTSP08, pages 145–154, Helsinki, Finland, February
2008.
[4] G. Bontempi, M. Birattari, and H. Bersini. Local learning
for iterated time-series prediction. In I. Bratko and S. Dze-
roski, editors, Machine Learning: Proceedings of the Six-
teenth International Conference, pages 32–38, San Fran-
cisco, CA, 1999. Morgan Kaufmann Publishers.
[5] G. Bontempi and S. Ben Taieb. Conditionally dependent
strategies for multiple-step-ahead prediction in local learn-
128-1
130. ing. International Journal of Forecasting, 27(3):689–699,
2011.
[6] Olivier Caelen, Gianluca Bontempi, and Luc Barvais. Ma-
chine learning techniques for decision support in anesthe-
sia. In Artificial Intelligence in Medicine, pages 165–169.
Springer Berlin Heidelberg, 2007.
[7] Haibin Cheng, Pang-Ning Tan, Jing Gao, and Jerry
Scripps. Multistep-ahead time series prediction. In
PAKDD, pages 765–774, 2006.
[8] Sven F. Crone, Michèle Hibon, and Konstantinos
Nikolopoulos. Advances in forecasting with neural net-
works? empirical evidence from the nn3 competition on
time series prediction. International Journal of Forecast-
ing, 27, 2011.
[9] M. Guo, Z. Bai, and H.Z. An. Multi-step prediction for non-
linear autoregressive models based on empirical distribu-
tions. Statistica Sinica, pages 559–570, 1999.
[10] D. M. Kline. Methods for multi-step time series forecasting
with neural networks. In G. Peter Zhang, editor, Neural
Networks in Business Forecasting, pages 226–250. Infor-
mation Science Publishing, 2004.
128-2
131. [11] Yann-Aël Le Borgne, Silvia Santini, and Gianluca Bon-
tempi. Adaptive model selection for time series pre-
diction in wireless sensor networks. Signal Processing,
87(12):3010–3020, 2007.
[12] Liran Lerman, Gianluca Bontempi, Souhaib Ben Taieb,
and Olivier Markowitch. A time series approach for pro-
filing attack. In SPACE, pages 75–94, 2013.
[13] José M. Matías. Multi-output nonparametric regression.
In Carlos Bento, Amílcar Cardoso, and Gaël Dias, edi-
tors, EPIA, volume 3808 of Lecture Notes in Computer
Science, pages 288–292. Springer, 2005.
[14] J. McNames. A nearest trajectory strategy for time series
prediction. In Proceedings of the InternationalWorkshop
on Advanced Black-Box Techniques for Nonlinear Model-
ing, pages 112–128, Belgium, 1998. K.U. Leuven.
[15] Charles A. Micchelli and Massimiliano A. Pontil. On learn-
ing vector-valued functions. Neural Comput., 17(1):177–
204, 2005.
[16] T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
[17] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, and A. Lendasse.
Methodology for long-term prediction of time series. Neu-
rocomputing, 70(16-18):2861–2869, October 2007.
128-3
132. [18] A. Sorjamaa and A. Lendasse. Time series prediction us-
ing dirrec strategy. In M. Verleysen, editor, ESANN06, Eu-
ropean Symposium on Artificial Neural Networks, pages
143–148, Bruges, Belgium, April 26-28 2006. European
Symposium on Artificial Neural Networks.
[19] Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa,
and Amaury Lendasse. Long-term prediction of time se-
ries by combining direct and mimo strategies. Interna-
tional Joint Conference on Neural Networks, 2009.
[20] H. Tong. Threshold models in Nonlinear Time Series Anal-
ysis. Springer Verlag, Berlin, 1983.
[21] Van Tung Tran, Bo-Suk Yang, and Andy Chit Chiow Tan.
Multi-step ahead direct prediction for the machine con-
dition prognosis using regression trees and neuro-fuzzy
systems. Expert Syst. Appl., 36(5):9378–9387, 2009.
[22] A.S. Weigend and N.A. Gershenfeld. Time Series Predic-
tion: forecasting the future and understanding the past.
Addison Wesley, Harlow, UK, 1994.
128-4