Traffic and Mobility Data Collection For Real Time Applications
Traffic and Mobility Data Collection For Real Time Applications
2
Annual Conference on Intelligent Transportation Systems
Madeira Island, Portugal, September 19-22, 2010
Network Collection
Sub-path flows
recognizing vehicles att multiple points in a network, where Coverage Method
Classification
Travel Times
specific detectors are real or virtually installed. [Huang,
OD Flows
Volume,
217
GPS X X X toll collection service, based on dedicated short-range
Cell-phone communications (DSCR) microwave communications, A5
Wide- X X X toll plazas, located between kilometer 11 and 19, are
tracking
area
Airborne
equipped with the electronic toll collection (ETC) devices.
X X X Due to the ETC use rate, and additional DSRC detectors
sensors
were installed to collect travel time information between
(1) Classification of the vehicles into two classes: regular cars and lorries,
from vehicles volume and estimated length; not compatible with axels
location points. Automatic video processing cameras and
classification methods. microwave radars are now under deployment to dense the
sensory infrastructure, as shown in the Figure 4.
First conclusion from methods and technologies comparison
is the highest complementarity between solutions, in terms Behind roadway telematics infrastructures and systems,
of functionalities and data gathered from transportation Brisa motorways, including A5, are fully equipped with a
system. private high-speed fiber-optic cabling and wireless solutions
to enable real-time remote monitoring of the network
III. CASE STUDY AREA conditions, but also, traffic management, tolling operations
and enforcement.
For the experimental analysis of this research study, we use
the A5 motorway, a 25 km inter-urban highway connecting TABLE II
Lisbon and Cascais in the West-cost of Portugal, presented
TELEMATICS SYSTEMS INSTALLED IN THE CASE STUDY
in Figure 3. The first stretch of this motorway, linking
AREA
Lisbon to the National Stadium in Oeiras, with 8 km length,
opened to the traffic in 1944 and became the first motorway Operating functions
in Portugal and one of the firsts in the world.
Number of
Management
A5 Main Figures
Traffic data
Telematics system
Monitoring
units
collection
& control
Geometry (km) 25
Tolling
Nodes 14
Ramp Toll gates 40 X X X
64
connections
ETC toll gates 20 X X X
Toll plazas 6
PTZ Video cameras 47 X X
Annual Average
daily traffic 67,200 VMS Variable
9 X
(AADT) message Signs
218
Sintra
Amadora
Cascais Lisboa
Oeiras
Loop detectors Toll plaza Microwave radars Video processing Point-to-Point DSRC
However, this straightforward approach to improve data Traffic flow rate (veh/h)
quality becomes ineffective in case of road incidents. For
this, extended algorithms for pattern analysis focus precisely Figure 5: A scatter plot for FCD average speed versus Point flow rate in
on the identification of unexpected significant variations so the A5 corridor
called incident-affected data or outliers, either in data
measured or predicted. Consistent values leads to identify Both central tendency and dispersion properties are obtained
outlying events, correspondent to roadway incidents. This as distributive functions, calculated by partitioning the
automatic process for incident detection is extremely useful dataset into smaller subsets and computing measures for
for network operations, to proceed with automatic responses each subset. The global and unique measure is obtained by
and control. However, new incidents conduct to unexpected merging results.
traffic patterns commonly not found in the historical The end result of this process step is a package of discrete
database. Therefore, data preprocessing can be used as a characteristics about traffic data, either for a short sliding
parallel process, valid and useful until an unpredicted window reaching real-time timeline or for the entire dataset
scenario event occurs and disrupt traffic conditions. for analysis, such as, the complete day. The spatial domain
In a short definition, data preprocessing is a combination of those characteristics is also stretchy and can be applied to
chain of techniques to be applied to improve data quality a single measure site, to a set of sites defined individually or
through the completeness, consistency and simplification of within a corridor, or to the network as a whole.
datasets. However, for some following pre-processing steps and for
A. Data properties summarization the some applications further on, it is essential to know how
current traffic conditions compare with past periods in the
To be successful and effective with real-time data historical database. Once again the time and space scale to
preprocessing it is essential to have a comprehensive, overall find out the reference period, depends on the application
picture of existing datasets. It can be based on summarized goal.
representative data properties, including highlights of data
values to be treated as noise or outliers. [Han, 2006] The B. Data cleaning – Missing values
technique is to understand the distribution of the data based Most real-time applications, such as monitoring, simulation,
on descriptive statistics, regarding both central tendency and require complete datasets without missing values. Data
dispersion of data. It includes mean, median, mode and cleaning techniques attempt to fill in missing values, smooth
midrange as measures of central tendency and quartiles, out noise while identifying outliers, and correct other
inter-quartile range and variance for data dispersion. Figure inconsistencies in the data. Several methods endeavor to
219
complete missing gaps dynamically, implementing A basic assumption is to consider that the evolution of traffic
estimation strategies balanced between data accuracy and patterns on a given day of the week is the same as the
computational complexity. Beyond this technical approach evolution of the traffic pattern on the corresponding day in a
to estimate wanting data, there is an essential variable prior reference week, constructed as a moving average over
to the strategy definition: the gap size. several weeks in the past. [Bellemans, 2000] This
construction requires dealing with scenarios such as
Taking into account the typical interval for roadside data
“special” days or official holidays on a weekday and days
acquisition and aggregation, varying from one to 5 minutes,
with major events where traffic patterns may differ
a missing gap up to 15 minutes, corresponding to a one-step
regression analysis, involving a response traffic value, , and For instance, to estimate a missing value for a time step
significantly from regular days.
iteration bootstrapping, can be estimated with a straight-line
a single predictor time-based period, . That is, in the current day , we use the following equation.
= + − 1
= ×
− 1
(1)
(5)
− 1
using the factor (6).
=
Figure 7, where missing values are estimated through the
∑ − ̅
||
(3) interpolation process (5) using values from the reference day
22.Jan - the regular weekday from previous week and the
correction factor.
= ̅ − ̅ (4) 120
100
80 0
7:30
7:40
7:50
8:00
8:10
8:20
8:30
8:40
8:50
9:00
9:10
9:20
9:30
60
7:40
7:50
8:00
8:10
8:20
8:30
8:40
8:50
9:00
9:10
9:20
9:30
January 29th 2010 - Carcavelos/Oeiras Service Area Figure 7: Missing data estimation for a 30 minutes missing gap using
reference data interpolation
29.Jan with missing gap 29.Jan estimated values 29.Jan complete data
The big advantage of this method is the simplicity that leads
Figure 6: Missing data estimation using linear regression to ease implementation and computation, useful for real-time
applications. However it fails thoroughly when traffic
For larger missing data gaps, (e.g. 45 to 120 minutes), but patterns suddenly changes because of any roadway incident
reasonable to be a punctual system or communications or demand fluctuations. To handle that situation, keeping the
failure, data completion can be estimated using historical low complexity, we propose a statistical-based analysis to
data in combination with the most recent observations. For establish dynamically an effective connection supported by
the present research work, gap sizes over 120 minutes are evolutionary traffic patterns.
considered input data failure, not able to be estimated from
Space-based similarity search
online and historical data sets.
This new process is based in time-series data analysis and
aims to find the most similar traffic pattern to the real-time
220
sequence pattern, or trail, to be used as reference data to From the real-time traffic trail, this function aims to discover
complete missing values gaps. Because of the heterogeneity in the historical database, the most similar trail to be defined
and complexity of the network of sensors and traffic as the reference trail. The similarity between two trail
measure database, we implement an experience-based structures can be measured by the normalized root mean
values for the 9-th element, and let the historical set of traffic
strategy starts with a space-time based correlation to identify
each 0<= is >-th trail index and the <= is the counterpart 9-
objective of this first step is to elect a representative pair of
elements, to be used on the next step.
th element.
To summarize such linear connections between elements
{ *& , *&' , *&' , … , *&'( } , the correlation coefficient + is +?@ 0, 0<= = A B‖ − <= ‖
!
(10)
∑( % − %̅ * − *,
given by the formula
from the covariance ./01.3 normalization, by division it by computed values DEF − D( ,
In other words, the coefficient of correlation is obtained terms of percentage, RMSD is then divided by the range of
+ =+
knowledge of the second variable-element.
trail 0 and the historical trail 0<= ,
(8) the minimum value of the NRMSD between the real-time
80
Therefore, shared variance is the variance accounted for in
one element by another element. For each sensor or data 60
source of the network, this space-based similarity process
40
establishes a reference element, to be used in the following
processing steps. 20
7:40
7:50
8:00
8:10
8:20
8:30
8:40
8:50
9:00
9:10
9:20
9:30
real-time index. The trail structure is topologically linear, The application of this method combination is shown in the
which can be represented by a vector of measures in the Figure 8. For the 29th of January, missing values in the
common timeline. sensor point-to-point Carcavelos/Oeiras Service Area, in the
time period 8:20 to 8:50 AM, are estimated using data from
221
20th of January, found in the space-time similarity search. the middle of the distribution in either direction”. They may
data is 0<HI .
Ultimate values are applying factor (6) where the reference be due to sensor noise, acquisition process instability,
equipment degradation, computer or communication system
TABLE III fails, or human-related errors. However, with some
applications, abrupt changes in the acquisition field may
RESULTS COMPARISON FOR 40 MINUTES MISSING GAP occur and cause fluctuations in upcoming observations from
Method RMSD the bulk of values. In case of traffic data measurement, such
Linear regression 13.05 sudden changes are usually related with traffic congestion
Interpolation with reference data 18.25 caused by accidents, broken vehicles or any other type of
Space-time similarity search 12.40 incident. For real-time applications, based on online data
collection and processing, it is crucial to assure data quality
As presented in the Table III, for short missing values through the identification and isolation of outliers.
periods, both linear regression and space-time similarity In this research work, automated detection of outliers and
search are satisfactory to estimate missing values. This way
removal were developed and integrated with automatic
promotes the usage of the linear regression, due the low
incident detection, in order to preserve all data, including
computational complexity and easy implementation.
such apparently unreliable data. Therefore, this process is
Next we process and present estimation values for a large made up of two distinct functions: i) identify, isolate and
missing gap, for the same reference day. replace outliers; and ii) automatic data-driven incident
120
detection. In this paper and section we focus on the first
function, in order to support real-time data applications.
Average speed (km/h)
100
80 The generation of outliers can be described by the time-
series process analysis additive outlier model [Martin, 1986].
M= = N= + O=
60
40
(13)
7:40
7:50
8:00
8:10
8:20
8:30
8:40
8:50
9:00
9:10
9:20
9:30
for > ≥ R ≥ 0, and the median value MS. For a data point %,
the distance T is
29.Jan with missing gap
Linear regression
T = U MT − MST U
Interpolation with reference data
(14)
Using the same package of techniques and methods, we estimated value obtained from the process to complete
produced a 90 minutes missing values gap for the 29th of missing values, defined previously.
January in the same sensor. Figure 9 presents the estimation
100 30
values results graphical comparison, using precisely the
28
same reference data sets. 90
26
Average speed (km/h)
TABLE IV 80 24
Distance dp
70 22
RESULTS COMPARISON FOR 90 MINUTES MISSING GAP
20
60 18
Method RMSD
Linear regression 21.03 50 16
Interpolation with reference data 29.81 14
40
Space-time similarity search 9.16 12
30 10
0:20
0:40
1:00
1:20
1:40
2:00
2:20
2:40
3:00
3:20
3:40
4:00
4:20
4:40
5:00
5:20
5:40
6:00
6:20
6:40
7:00
222
D. Data reduction will lead to low-quality results processing. For real-time
For conventional surveillance systems, raw data is gathered traffic data applications this postulation is even more
from a multi-variety of sensing methods either installed on significant, since low-quality information for decision
the network operators’ infrastructure or on fleeting vehicles, support will lead to incompetent control and management.
as previously described. This increasing array of data Data preprocessing and cleaning defines a set of techniques
sources leads to an increasing difficulty to accomplish and methods to analyze databases, identify errors and
desired results by making use of the whole data. [Huang et inconsistencies and proceed with dataset correction,
al., 2009] Some of the main reasons for that relies on the completion and simplification.
intrinsic characteristics of data sources and types: i) Are This paper presents a suite of combination methods to
deployed with uneven density over the network; 2) Are analyze and summarize real-time traffic data sets, to estimate
missing values and to identify and correct outliers in
heterogeneous in type; 3) Provide highly correlated data; 4)
datasets. Either used separately in the simplest way for short
Report at non-uniformed resolution; 5) Report at different
missing gaps or in a complex way, combining several
frequencies. Data reduction techniques can be applied to methods and time-space based historical datasets, data
work out some of those difficulties by harmonizing data preprocessing techniques aims to improve data quality
references and dimensions, and bringing down the size and through the harmonization and completeness.
complexity of datasets, maintaining the integrity of the
original data. Our contribution is the designing and implementation of a
In the present work we design a two-tier approach for data systematic methodology to measure, check and repair traffic
reduction: i) at the data acquisition level; and ii) at data data in order to enable following steps in the processing
fusion level. For the first tier, data acquisition process chain till decision making and transfer. With this approach,
computes raw or elementary data and, along with events facing data problems in early stage, including data
detection, proceeds with data aggregation and simplification, reduction and integration, we promote the
summarization per regular periods, varying in space, time multi-purpose usage of traffic data. However, the key
and measure type. For the second tier, heterogeneous data advantage of our process definition goes for real-time data
sets are merged together do obtain a single data platform, applications, used to manage primitive data with noise and
and is defined as data integration and fusion. errors inside. Future works in this research program will take
Data fusion is the process of merging together information advantage of this work, and will focus data fusion and
gathered from various heterogeneous sensors, into a single decision making processes.
data platform. In space-time domain, such as traffic field,
data fusion is synonymous with data integration and aims to ACKNOWLEDGMENT
combine diverse data sets into a unified, or fused, data set This research work is supported by Brisa Auto-Estradas de
which includes all of the data points and time frames from Portugal, a leading world toll motorway operator and
the input data sets. The resulted data set differs from a transportation infrastructures manager. Brisa is also member
simple merged superset in that data tuples contains attributes of MIT-Portugal research program and, with an active
and metadata which might not have been included for these collaboration of Brisa Innovation research group and the
points in the original data set. MIT ITS lab, to develop this work.
TABLE V REFERENCES
[Antoniou et al., 2008] Antoniou, C., Balakrishna, R., and Koutsopoulos,
TRAFFIC DATA, INFORMATION AND KNOWLEDGE H. N. (2008). Emerging data collection technologies and their impact on
Data traffic management applications. Proceedings of the 10th International
Knowledge
acquisition Conference on Application of Advanced Technologies in Transportation,
Data Fusion and decision
and Athens, Greece.
making
preprocessing [Bellemans, 2000] Bellemans, T., Schutter, B., Moor, B. (2000). Data
Elementary Situation acquisition, interfacing and pre-processing of highway traffic data.
Object data Response
data information Proceedings of Telematics Automotive 2000, Birmingham, UK, vol. 1, pp
- Sensor data - Point speed - Road point - Traffic 4/1-4/7, Apr. 2000
- Tolling data - Point flow conditions control [Han, 2006] Han, J., Kamber, M. (2006). Data Mining Concepts and
- Road - OD travel - Link conditions - Driver Techniques (2nd Edition). Elsevier, Morgan Kaufmann Publishers.
segment data times - OD conditions warning [Huang et al., 2009] Huang, E., Antoniou, C., Ben-Akiva, M., Lopes, J.,
- GPS data - OD flows - Driver-choice - Congestion Bento, J. (2009). Real-time multi-sensor multi-source network data fusion
- Cell-phone - Class-relative options pricing using dynamic traffic assignment models. In proceedings to the 12th
data density - Incidents - Maintenance International IEEE Conference on Intelligent Transportation Systems.
- (…) - (…) - (…) - (…) [Huang, 2010] Huang, E. (2010). Algorithmic and Implementation Aspects
Table V presents a data architecture overview, concerning of Online Calibration of Dynamic Traffic Assignment. Master's thesis,
Massachusetts Institute of Technology.
the evaluation chain from elementary data to knowledge for [Liu, 2004] Liu, H., Shah, S., Jiang, W. (2004). On-line outlier detection
decision making support. and data cleaning. Journal of Computers and chemical engineering. Pages
1635-1647. Elsevier
[Klein, 2006] Klein, L., Mills, M., Gibson, D. (2006). Traffic Detector
Handbook: Third Edition—Volume I. Report No. FHWA-HRT-06-108.
V. DISCUSSION AND FUTURE WORKS Federal Highway Administration, USA
Real-world traffic databases are highly susceptible to noise, [Martin, 1986] Martin R., Yohai, V. (1986). Influence Functionals for Time
redundancy and inconsistent data due to their typically huge Series. Ann. Statist., Volume 14, Number 3
size and their likely origin from multiple, heterogeneous [Mendenhall, 1993] Mendenhall, W., Reinmuth, J., Beaver, R. (1993).
sources and sensory technologies. Low-quality traffic data Statistics for Management and Economics. Belmont, Duxbury Press
223