Collective Traffic Forecasting
Collective Traffic Forecasting
Collective Traffic Forecasting
1 Introduction
Markov logic [6] integrates first-order logic with probabilistic graphical models,
providing a formalism which allows us to describe a domain in terms of logic
predicates and probabilistic formulae. While a first-order knowledge base can
be seen as a set of hard constraints over possible worlds (or Herbrand interpre-
tations), where a world violating even a single formula has zero probability, in
Markov logic such a world would be less probable, but not impossible. Formally,
a Markov logic network (MLN) is defined by a set of first-order logic formulae
F = {F1 , . . . , Fn } and a set of constants C = {C1 , . . . , Ck }. A Markov random
field is then created by introducing a binary node for each possible ground atom
and an edge between two nodes if the corresponding atoms appear together in a
ground formula. Uncertainty is handled by attaching a real-valued weight wj to
each formula Fj : the higher the weight, the lower the probability of a world violat-
ing that formula, others things being equal. In the discriminative setting, MLNs
essentially define a template for arbitrary (non linear-chain) conditional random
fields that would be hard to specify and maintain if hand-coded. The language
of first-order logic, in fact, allows to describe relations and inter-dependencies
between the different domain objects in a straightforward way. In this paper,
we are interested in the supervised learning setting. In Markov logic, the usual
distinction between the input and output portions of the data is reflected in the
distinction between evidence and query atoms. In this setting, an MLN defines
a conditional probability distribution of query atoms Y given evidence atoms
X, expressed as a log-linear model in a feature space described by all possible
groundings of each formula:
P
exp Fi ∈FY wi ni (x, y)
P (Y = y|X = x) = (1)
Zx
where FY is the set of clauses involving query atoms and ni (x, y) is the number
of groundings of formula Fi satisfied in world (x, y). Note that the feature space
jointly involves X and Y as in other approached to structured output learning.
MAP inference in this setting allows us to collectively predict the truth value
of all query ground atoms: f (x) = y ∗ = arg maxy P (Y = y|X = x). Solving the
MAP inference problem is known to be intractable but even if we could solve
it exactly, the prediction function f is still linear in the feature space induced
by the logic formulae. Hence, a crucial ingredient for obtaining an expressive
model (which often means an accurate model) is the ability of tailoring the fea-
ture space to the problem at hand. For some problems, this space needs to be
high-dimensional. For example, it is well known that linear chain conditional
random fields (which we can see as a special case of discriminative MLNs), often
work better in practice when using high-dimensional feature spaces. However,
the logic language behind MLNs only offers a limited ability for controlling the
size of the feature space. We will explain this using the following example. Sup-
pose we have a certain query predicate of interest, Query(t, s) (where, e.g., the
variable t and s represent time and space) that we know to be predictable from
a certain set of attributes, one for each (t, s) pair, represented by the evidence
predicate Attributes(t, s, a1 , a2 , . . . , an ). Also, suppose that performance for
this hypothetical problem crucially depends, for each t and s, on our ability
of defining a nonlinear mapping between the attributes and the query. To fix
our ideas, imagine that an SVM with RBF kernel taking a1 , a2 , . . . , an as inputs
(treating each (s, t) pair as an independent example) already produces a good
classifier, while a linear classifier fails. Finally, suppose we have some available
background knowledge, which might help us to write formulae introducing sta-
tistical interdependencies between different query ground atoms (at different t
and s), thus giving us a potential advantage in using a non-iid classifier for this
problem. An MLN would be a good candidate for solving such a problem, but
emulating the already good feature space induced by the RBF kernel may be
tricky. One possibility for producing a very high dimensional feature space is to
define a feature for each possible configuration of the attributes. This can be
achieved by writing several ground formulae with different associated weights.
For this purpose, in the Alchemy system1 , one might write an expression like
where the + symbol preceding some of the variables expands the expression into
separate formulae resulting from the possible combination of constants from
those variables. Different weights are attached to each formula in the resulting
expansion. Yet, this solution presents two main limitations: first, the number
of parameters of the MLN grows exponentially with the number of variables in
the formula; second, if some of the attributes ai are continuous, they need to be
discretized in order to be used within the model.
GS-MLNs [12] allow us to use weights that depend on the specific grounding
of a formula, even if the number of possible groundings can in principle grow
exponentially or can be unbound in the case of real-valued constants. Under this
model, we can write formulae of the kind:
where v has the type of an n-dimensional real vector, and the $ symbol indi-
cates that the weight of the formula is a parameterized function of the specific
constant substituted for the variable v. In our approach, the function is realized
by a discriminative classifier, such as a neural network with adjustable parame-
ters θ. The idea of integrating non-linear classifiers like neural networks within
conditional random fields has been also recently proposed in conditional neural
fields [14].
In MLN with grounding-specific weights, the conditional probability of query
atoms given evidence can therefore be rewritten as follows:
P P
exp Fi ∈FY j wi (cij , θi )nij (x, y)
P (Y = y|X = x) = (2)
Zx
∂P (y|x) X
= ni (x, y) − P (y 0 |x)ni (x, y ∗ ) = ni (x, y) − Ew [ni (x, y)]
∂wi 0 y
which are usually approximated with the counts in the MAP state y ∗ :
∂P (y|x)
' ni (x, y) − ni (x, y ∗ )
∂wi
From the above equation, we see that if all the groundings of formula Fj are
correctly assigned their truth values in the MAP state y ∗ , then that formula gives
a zero contribution to the gradient, because nj (x, y) = nj (x, y ∗ ). For grounding-
specific formulae, each grounding corresponds to a different example for the
neural network: therefore, there will be no backpropagation term for a given
example if the truth value of the corresponding atom has been correctly assigned
by the collective inference.
When learning from many independent interpretations, it is possible to split
the data set into minibatches and apply stochastic gradient descent [2]. Basically
this means that gradients of the likelihood are only computed for small batches
of interpretations and weights (both for the MLN and for the neural networks)
are updated immediately, before working with the subsequent interpretations.
Stochastic gradient descent can be more generally applied to minibatches con-
sisting of the connected components of the Markov random field generated by
the MLN. This trick is inspired by a common practice when training neural
networks and can very significantly speedup training time.
speed of the vehicles. The PeMS infrastructure collects filtered and aggregated
flow and occupancy from single loop detectors, and provides an estimate of the
speed [9] and other derived quantities. In some locations, a double loop detector
is used to directly measure the instantaneous speed of the vehicles. All traffic
detectors report measurements every 30 seconds.
In our experiments, the goal is to predict whether the average speed at a
certain time in the future falls under a certain threshold. This is the measurement
employed by GoogleTM Maps2 for the coloring scheme encoding the different
levels of traffic congestions: the yellow code, for example, means that the average
speed is below 50 mph, which is the threshold adopted in all our experiments.
In our case study, we focused on seven locations in the area of East Los
Angeles (see Figure 1), five of which are placed on the I10 Highway (direction
West), one on the I5 (direction South) and one on the I710 (direction South)
(see Table 1). We aggregated the available raw data into 15-minutes samples,
averaging the measurements taken on the different lanes. In all our experiments
we used the previous three hours of measurements as the input portion of the
data. For all considered locations we predict traffic congestions at the next four
lead times (i.e., 15, 30, 45 and 60 minutes ahead). Thus each interpretation spans
a time interval of four hours. We used two months of data (Jan-Feb 2008) as
2
http://maps.google.com
Flow Direction
E F D C A G B
E 0.96
0.88
F
0.80
D
0.72
C
0.64
A
0.56
G
0.48
B
0.40
Fig. 2. Spatiotemporal correlations in the training set data. There are 28 boolean con-
gestion variables corresponding to 7 measurement stations and 4 lead times. Rows and
columns are lexicographically sorted on the station-lead time pair. With the exception
of station E, spatial correlations among nearby stations are very strong and we can
observe the spatiotemporal propagation of the congestion state along the direction of
flow (traffic is westbound).
training set, one month (Mar 2008) as tuning set, and two months (Apr-May
2008) for test. Time intervals of four hours containing missing measurements
due to temporary faults in the sensors were discarded from the data set.
The inter-dependencies between nodes which are close in the transportation
network are evident from the simple correlation diagram shown in Figure 2.
The GS-MLN model was trained under the learning from interpretations setting.
An interpretation in this case corresponds to a typical forecasting session, where
at time t we want to forecast the congestion state of the network at future
lead times, given previous measurements. Hence interpretations are indexed by
their time stamp t, which is therefore be omitted in all formulae (the temporal
index h in the formulae below refers to the time lead of the prediction, i.e. 1,2,3,
and 4 for 15,30,45, and 60 minutes ahead). Interpretations are assumed to be
independent, and this essentially follows the setting of other supervised learning
approaches such as [24, 18, 17]. However, in our approach congestion states at
different lead times and at different sites are predicted collectively. Dependencies
are introduced by spatiotemporal neighborhood rules, such as
The MLN contained 14 formulae in the background knowledge and 125 param-
eters after grounding variables prefixed by a +. The 28 neural networks had 12
continuous inputs and 5 hidden units each, yielding about 2000 parameters in to-
tal. Our software implementation is a modified version of the Alchemy system to
incorporate neural network as pluggable components. Inference was performed
by MaxWalkSat algorithm. Twenty epochs of stochastic gradient ascent were
performed, with a learning rate = 0.03 for the MLN weights, and µ = 0.00003
n
for the neural networks, being n the number of misclassifications in the cur-
rent minibatch. In order to further speed up the training procedure, all neural
networks were pre-trained for a few epochs (using the congestion state as the
target) before plugging them into the GS-MLN jointly and tuning the whole set
of parameters.
We compared the obtained results against three competitors:
Trivial predictor The seasonal average classifier predicts, for any time of the
day, the congestion state observed on average in the training set at that
time. Although it is a baseline predictor, it is widely used in literature as a
competitor.
SVM We used SVM as a representative of state-of-the-art propositional clas-
sifiers. A different SVM with RBF kernel was trained for each station and
for each lead time, performing a separated model selection for the C and γ
values to be adopted for each measurement station. The features used by
the SVM predictor consist in the speed time series observed in the past 180
minutes, aggregated at 15 minutes intervals, hence producing 12 features,
plus an additional feature representing the seasonal average at current time.
A gaussian normalization was applied to all the features.
Standard MLN When implementing the classifier based on standard MLNs,
the speed time series had to be discretized in order to be used within the
model. Five different speed classes were used, and the quantization thresholds
were chosen by following a maximum entropy strategy. The trend of the speed
time series was modeled by the following set of formulae that were used in
place of formula 5:
where predicate Speed Past j(node, speed value) encodes the discrete val-
ues of the speed at the j-th time step before the current time. Note that
an MLN containing only the above formulae essentially represents a logistic
regression classifier taking the discretized features as inputs. All remaining
formulae were identical to those used in conjunction with the GS-MLN.
As for the predictor based on GS-MLNs, there is no need to use discretized
features, but the same vectors of features used by the SVM classifier can be
adopted.
Table 2. Percentage of true ground atoms, for each measurement station. The per-
centage of days in the train/test set containing at least one congestion is reported in
the last two columns.
Station % pos train % pos test % pos days train % pos days test
A 11.8 9.2 78.3 70.7
B 5.8 4.9 60.0 53.4
C 16.8 13.7 66.6 86.9
D 3.4 2.3 45.0 31.0
E 28.2 22.9 86.7 72.4
F 3.9 1.8 51.6 31.0
G 1.9 1.7 30.0 22.4
Given the unbalanced data set, we compare the predictors on the F1 measure,
as the harmonic mean between precision P = T PT+F P TP
P and recall R = T P +F N :
2P R
F1 = P +R . Table 3 shows the F1 measure, averaged per station. The advantages
of the relational approach are much more evident when increasing the prediction
horizon: at 45 and 60 minutes ahead, the improvement of the GS-MLN model is
statistically significant, according to a Wilcoxon paired test, with p-value< 0.05.
Detailed comparisons for each sensor station at 15, 30, 45, and 60 minutes ahead
are reported in Tables 4 , Tables 5 , Tables 6 and 7, respectively. These tables
show that congestion at some of the sites are clearly “easier” to predict than at
other sites. Comparing Tables 4-7 to Table 2 we see that the difficulty strongly
correlates with the data set imbalance, an effect which is hardly surprising. It
is also often the case that GS-MLN significantly outperforms the SVM classifier
for “difficult” sites. The comparison between the standard MLN and the GS-
MLN shows that input quantization can significantly deteriorate performance,
all other things being equal. This supports the proposed strategy of embedding
neural networks as a key component of the model.
An interesting performance measure considers only those test cases in which
traffic conditions are anomalous with respect to the typical seasonal behavior.
To this aim, we restricted the test set, by collecting only those interpretations
for which the baseline seasonal average classifier would miss the prediction of
the current congestion state. Table 8 shows that the advantage of the relational
approach is still evident for long prediction horizons.
The experiments were performed on a 3GHz processor with 4Mb cache. The
total training time for SVM is 40 minutes, and 7-8 hours for GS-MLNs. As for
testing times, both systems perform in real-time.
Table 3. Comparison between the tested predictors. Results show the F1 on the posi-
tive class, averaged on the seven nodes. The symbol indicates a significant loss of the
method with respect to GS-MLN, according to a Wilcoxon paired test (p-value<0.05).
15 m 30 m 45 m 60 m
Seasonal Avg 38.3 38.3 38.3 38.3
SVM 81.7 68.6 56.4 51.8
MLN 59.5 56.5 53.6 50.4
GS-MLN 80.9 69.2 61.6 56.9
Table 8. Comparison between the tested predictors, only on those cases where the
seasonal average predictor fails. Results show the F1 on the positive class, averaged on
the seven nodes.
15 m 30 m 45 m 60 m
SVM 81.4 69.1 59.1 59.2
MLN 39.9 47.6 48.4 41.6
GS-MLN 78.4 68.2 68.4 65.5
The problem of missing or incomplete data is crucial in all time series forecasting
applications [3, 4]: in the case of punctual missing information, a reconstruction
algorithm might be employed in order to interpolate the signal, so that prediction
methods might be applied unchanged. Occasionally, sensor faults can last several
time steps, and when this happens, a large part of the input can be unavailable to
a standard propositional predictor until the sensor recovers from the failure state.
Of course, cases containing missing data can be filtered from the training set as
we did for our previous experiments. However, in order to deploy a predictor
on a real-time task, it is necessary also to handle the case of missing values at
prediction time. A relational model can be in principle more robust than its
propositional counterpart by exploiting information from nearby sites.
In this section we report results obtained by simulating the absence of several
values within the observed time series, using the seasonal average predictor (Sec-
tion 3.2) as reconstruction algorithm for these unobserved data . Producing an
accurate model of sensor faults is clearly beyond the scope of this paper and we
built a naive observation model based on a two states first-order Markov chain
with P (observed 7→ observed) = 0.99 and P (reconstructed 7→ reconstructed) =
0.9. The performance of the predictors on this task are shown in Table 9.
5 Conclusions
15 m 30 m 45 m 60 m
SVM 79.0 63.2 53.6 48.8
GS-MLN 80.5 70.4 62.6 58.1
transportation network, and at multiple lead times in the future, exploiting the
relational structure of the domain. Our method is based on grounding-specific
Markov logic networks, which extend the framework of Markov logic in order to
include discriminative classifiers and generic vectors of features within the model.
Experimental results performed on a case study extracted from the Californian
PeMS data set show that the relational approach outperforms the propositional
one, in particular when the prediction horizon grows.
Although we performed experiments on a binary classification task, we plan
to extend the framework also to the case of multiclass classification or ordinal
regression. As a further direction of research, the use of Markov logic gives the
possibility to extend the model by applying structure learning algorithms to
learn relations and dependencies directly from data in an automatic way.
The proposed methodology is not restricted to traffic management, but it can
be applied to several different time series application domains, such as ecologic
time series, for air pollution monitoring, or economic time series, for marketing
analysis.
Acknowledgments
This research is partially supported by grant SSAMM-2009 from the Foundation
for Research and Innovation of the University of Florence.
References
1. B. Abdulhai, H. Porwal, and W. Recker. Short-term freeway traffic flow predic-
tion using genetically optimized time-delay-based neural networks. Transportation
Research Board, 78th Annual Meeting, Washington D.C, 1999.
2. Léon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg,
editors, Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intel-
ligence, LNAI 3176, pages 146–168. Springer Verlag, Berlin, 2004.
3. G. Box, G. M. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting &
Control (3rd Edition). Prentice Hall, 3rd edition, February 1994.
4. C. Chatfield. The Analysis of Time Series: An Introduction. Chapman &
Hall/CRC, sixth edition, July 2003.
5. T. G. Dietterich, P. Domingos, L. Getoor, S. Muggleton, and P. Tadepalli. Struc-
tured machine learning: the next ten years. Machine Learning, 73(1):3–23, 2008.
6. P. Domingos, S. Kok, D. Lowd, H. Poon, M. Richardson, and P. Singla. Markov
logic. In Probabilistic Inductive Logic Programming, pages 92–117, 2008.
7. B. Ghosh, B. Basu, and M. O’Mahony. Multivariate short-term traffic flow fore-
casting using time-series analysis. Trans. Intell. Transport. Sys., 10(2):246–254,
2009.
8. C. W. J. Granger and Paul Newbold. Forecasting Economic Time Series (Economic
Theory and Mathematical Economics). Academic Press, 1977.
9. Z. Jia, C. Chen, B. Coifman, and P. Varaiya. The pems algorithms for accurate,
real-time estimates of g-factors and speeds from single-loop detectors. pages 536
–541, 2001.
10. Y. Kamarianakis and P. Prastacos. Space-time modeling of traffic flow. Comput.
Geosci., 31:119–133, March 2005.
11. B. S. Kerner, C. Demir, R. G. Herrtwich, S. L. Klenov, H. Rehborn, M. Aleksic,
and A. Haug. Traffic state detection with floating car data in road networks. In
Intelligent Transportation Systems, 2005. Proceedings. 2005 IEEE, pages 44–49,
2005.
12. M. Lippi and P. Frasconi. Prediction of protein beta-residue contacts by markov
logic networks with grounding-specific weights. Bioinformatics, 25(18):2326–2333,
2009.
13. R. B. Noland and J. W. Polak. Travel time variability: a review of theoretical and
empirical issues. Transport Reviews: A Transnational Transdisciplinary Journal,
22:39–54, 2002.
14. J. Peng, L. Bo, and J. Xu. Conditional neural fields. In Y. Bengio, D. Schuur-
mans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural
Information Processing Systems 22, pages 1419–1427. 2009.
15. D. L. Selby and R. Powell. Urban traffic control system incorporating scoot: design
and implementation. Institution of Civil Engineers Proceedings, 82:903–20, Oct
1987.
16. AG. Sims. S.c.a.t. the sydney co-ordinated adaptive traffic system. Symposium on
Computer Control of Transport 1981: Preprints of Papers, pages 22–26, 1981.
17. B. L. Smith and M. J. Demetsky. Short-term traffic flow prediction: neural network
approach. Transportation Research Record, 1453:98–104, 1997.
18. B. L. Smith and M. J. Demetsky. Traffic flow forecasting: Comparison of modeling
approaches. Journal of Transportation Engineering-Asce, 123(4):261–266, Jul-Aug
1997.
19. B. L. Smith, B.M. Williams, and R. Keith Oswald. Comparison of parametric and
nonparametric models for traffic flow forecasting. Transportation Research Part C,
10(4):303–321, 2002.
20. S. Sun, C. Zhang, and G. Yu. A bayesian network approach to traffic flow fore-
casting. IEEE Transactions on Intelligent Transportation Systems, 7(1):124–132,
2006.
21. P. Varaiya. Freeway Performance Measurement System: Final Report. PATH
Working Paper UCB-ITS-PWP-2001-1, University of California Berkley, 2001.
22. S. Watson. Combining kohonen maps with arima time series models to forecast traf-
fic flow. Transportation Research Part C: Emerging Technologies, 4:307–318(12),
October 1996.
23. J. Whittaker, S. Garside, and K. Lindveld. Tracking and predicting a network
traffic process. International Journal of Forecasting, 13(1):51–61, 1997.
24. C. H. Wu, J. M. Ho, and D. T. Lee. Travel-time prediction with support vector
regression. Ieee Transactions On Intelligent Transportation Systems, 5(4):276–281,
December 2004.