eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
Final Project Report #9
1
Tracking-dependent and
interactive video projection
Matei Mancas (1), Donald Glowinski (2), Pierre Bretéché (3), Jonathan Demeyer (1), Thierry Ravet
(1), Gualtiero Volpe (2), Antonio Camurri (2), Paolo Coletta (2)
(1) FPMS, Mons, Belgium (2) Casa Paganini/InfoMus Lab, Genova, Italy
(3) Laseldi Lab, Montbéliard, France
Abstract
Gestures’ expressivity, as perceived by humans, may be related
to the amount of attention they attract. In this paper, we present
three experiments that quantify behavior saliency by the rarity of
selected motion and gestural features in a given context. The first
two ones deal with the current quantity of motion of a person’s
silhouette compared to a brief history of his quantity of motion
values and with the current speed compared to a brief history of
the person’s speed. The third one focuses on the motion speed of
a person compared to the motion speed of other persons around
him. Considering both features (speed and quantity of motion)
and contexts (space and time), we compute an attention index
providing cues on the behavior novelty. This can be considered as
a preliminary step to an expressive gesture analysis based on
behavioral. In order to achieve accurate tracking, a fusion
between color and IR camera streams is achieved. This fusion let
us have a robust tracking system with respect to illumination and
partial occlusions issues.
Index Terms—computational attention, saliency, rarity, data
fusion, tracking, gestures.
I.
PRESENTATION OF THE PROJECT
A. Introduction: towards a context-based gestural analysis
lot of research effort has been devoted to robustly track
humans in a scene and to analyze their gestures in
order to individuate and characterize their behavior.
Gestural analysis, often applies in situations where either
the human on which the analysis is carried on is previously
selected or the same kind of analysis is performed to all the
subjects that can be distinguished in the scene. A recent
field of research aims at investigating collective behaviors [1].
Still, the object of the analysis is already defined and the
work mainly focuses on characterizing collective
displacements. The possibility of dynamically selecting the
person to carry analysis on or to adapt analysis to the current
behavior of a person in a context-dependent way would open
new directions for gesture research. Human beings naturally
show the capacity to dedicate limited perceptual resources to
what is of particular interest. However, computers capacity to
exhibit a behavior worthy of attention remains very limited.
A
B. Computational attention interest in Human-Computer
Interfaces (HCI)
In many real situations, where people interact freely together,
it can be difficult to select the participants that exhibit a
behavior worthy of attention.
The design of (expressive) gestures interface can gain from a
better understanding of individual- and context-dependent
human behavior. It can ensure their usability in a more
naturalistic environment. Automatic attention cues are also
able to simplify information access in those complexsituations which leads, in the HCI domain at foster interaction,
anticipating focus of attention as automatic zoom on the
Region Of Interest (ROI).
C. Work overview
The project consisted in two main steps:
A robust tracking system
This system uses both color video cameras and infra-red
cameras to be robust enough to illumination changes and
partial occlusions. Infra-red (IR) camera is much less sensitive
to light changes if those lights are not directly in the camera
field of view (FOV). The color camera FOV is larger and it is
able to keep on the tracking process if infra-red markers are
occluded or out of the infra-red camera FOV.
Motion attention: human-like reactions
Once participant tracking is robust enough to handle
naturalistic scenarios, an automatic attention index can be
computed which highlights which movements should be the
most “interesting” for a human observer. Attention is
computed both in a spatial and in a temporal context on
several features: speed and quantity of motion.
In section II, after a hardware and software overview, we will
describe the video data acquisition and processing (blob
segmentation) for each one of the two video modalities. In
section III, the fusion mechanism which leaded to a robust
tracking system will be explained. Section IV deals with
attention computation both in spatial and temporal context.
Finally, we will conclude by a discussion in section V. The
source code of this project and some video demos can be
found on the eNTERFACE 2008 workshop website [2].
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
Final Project Report #9
2
II. SIGNAL ACQUISITION
A. Material and system overview
We developed a setup that analyzes human behavior in a
flexible environment with regard to illumination changes.
Three cameras were used to capture video: two “Eneo VKC1354” color analogical cameras with a 752*582 pixel
resolution at 25 frames per second (fps) and one “Imaging
Source DMK 31BF03” monochrome digital camera delivering
a 1024*768 pixel resolution at 30 fps. They were equally
placed on the side of a 3 x 3 meters area, at a height of 2.5
meters, downward looking above the participants and
recording with constant shutter, manual gain and focus. The
participants used each one a red hat (for color segmentation)
and a halogen light which emits visible but also infra-red light
(for IR segmentation) in all space directions. Figure 1 shows
the setup configuration.
A non-invasive video-based approach was adopted based on
the EyesWeb XMI free software platform. We were interested
in automatically extracting the displacement of people moving
in front of the camera and computing their motion features.
Head detection and tracking solutions were privileged to fully
exploit the reduced available space, obtain depth information
and avoid occlusions due to the interactions between
participants.
The EyesWeb XMI (www.eyesweb.org) is a free software
platform [3]. It consists of two main components: a kernel and
a graphical user interface (GUI). The GUI manages interaction
with the user and provides all features needed to design
patches. It allows hand fast development of custom interfaces
for use in artistic performances and interactive multimedia
installations. The kernel manages real-time data processing,
synchronization of multimodal signals. It supports the
integration of user-developed plugins; an SDK (Software
Development Kit) is provided to simplify the creation of such
plugins by means of Microsoft Visual C++. The userdeveloped plugins, together with the ones provided with
EyesWeb are the building blocks that the end user can
interconnect to create an application (patch).
B. Blob Detection
The present system’s analysis of human activity starts from
foreground segmentation based on the analysis of the color
and infra-red video streams. This analysis provides a binary
mask of the spatial extension of the region of interest through
time (blob detection).
Fig. 1: Experimental setup
As described on Figure 2, two computers were used to
perform the IR video and color video stream acquisition and
processing (IR-and Color-based blob detection and tracking).
The third PC was used to achieve data fusion between the IR
and color stream data and to further higher-level processing
and rendering.
Fig. 2: System overview
Color video stream
A skin color detection algorithm was used on the signal
coming from the color camera. We developed a modified
version of the Continuously Adaptive Mean Shift Algorithm
(CamShift) which is itself an adaptation of the Mean Shift
algorithm for object tracking. Our method consisted in
manually selecting the color of interest (COI) which is
converted into a HSV colorimetric system. The set of colored
pixel is quantized in a one-dimensional histogram to create the
COI’s model. Furthermore, a bandwidth of acceptable Hue
and Saturation values are defined to allow the tracker to
compute the probability that any given pixel value
corresponds to the selected color. In order to enhance the
system robustness to illumination changes, the color model
was updated in several areas of the scene which have different
illumination. In that way the color model was resistant to
moderate illumination changes as those present on our
scenario visible on Figure 3 between the left-side and the
right-side of the scene.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
3
Final Project Report #9
X
X
X
(1)
(2)
(3)
Fig. 6. From left to right, illustration of (1) a 4-connectivity: the four
filled circles (pixels) are connected to the one of interest (the cross);
we defined this case as adjacency 1. (2) 8-connectivity, defined as
adjacency = 2; (3) 20-connectivity defined as adjacency = 3. This
adjacency measure can be generalized to n-connectivity
Fig. 3: Color-based red hat blob detection
Basing on this definition, we have used the following
algorithm for image segmentation and tracking:
Infra-red video stream
The signal coming from the IR camera is not affected by
illumination variations but might be artefacted by light
reflections or some static infrared sources. We processed the
video stream with a background subtraction to eliminate static
elements. Then, we binarized the signal with an empiricallytested threshold value to extract the moving regions of interest
(blobs). Figure 4 shows that the IR lights located on the top of
the red hats are very clearly detected.
Definitions
Image segmentation image_sgm (video frame f) returns the
set of distinct connected components cc (f) such that each
pixel of an item in cc (f) is at distance x
threshold from any
other pixel belonging to other items in cc (f)
Valid region valid_reg is user defined as the maximum
Euclidean distance between two blobs’ baricentres in two
consecutive frames.
Minimum difference min_diff returns the following:
min_diff = α
Fig. 4: Infra-red lights located on the top of the red hats detection
C. Single and Multi-Blob Tracking
Resulting from the pre-processing step (color or IR based), we
obtained a binary image where white represents the
foreground objects. The next step is to assign a label that
identifies the different white blobs, and to track them. The
tracking result can be seen on Figure 5.
(1)
(2)
Fig. 5: Multi-blob tracking : from left to right the labeling
(identification) of (1) the three blobs of the IR video stream (2) the
three blobs of the color video stream
To achieve the image tracking, we defined an adjacency
measure based on the n-connectivity of two pixels (Figure 6).
dist(b1 ,b2) + β
| area(b1) – area (b2) |
(1)
where dist (b1, b2) is the Euclidean distance between the
baricentres of blobs b1 and b2 and α and β are the weights of
area and position, comprised between:
Α = [0,1] and β = [0,1]
These values are manually decided according to the cam’s
FOV and the foreground objects to track (e.g. humans)
Procedure
Initialization: t = 0
cc(t=0) = image_sgm( frame(0) );
store cc(0) and assign a new label to each item in cc(0)
For t = 1, 2, …
cc(t) = image_ sgm( frame(t) );; i.e. the set of distinct blobs
in current frame ( frame(t) );
store cc(t)
for each item x in cc(t)
find x in cc(t-1) which minimizes min_diff
if x is in valid_reg, then
{
assign y the same label as x
if x is recognized in N consecutive frames,
then the item is considered to be trackable (i.e.,
we can measure its velocity, direction, etc.)
else
the item is unstable (recognized but not tracked)
}
else
y is assigned a new distinct label
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
III. DATA FUSION
A. A common reference
Prior to any fusion algorithm, we need to provide a common
reference to the signals to be fused. As we worked with
different cameras, their fields of view (FOV) were different. A
robust transformation was necessary to match the position of a
point computed on a frame from one video camera, to the
position of the same point computed on a frame from another
video camera. The projective transformation meets our need
because the perspective is subject to change with respect to the
camera focal distance. A correction of the radial distortion was
not necessary as the distortion due to wide-angle lenses was
not very important.
The projective transformation conserves the proportions
between two sets of two points: a quadrilateral is projected
onto another one. Since it is not a linear transformation, we
used homogeneous coordinates to compute the transformation
with a matrix product.
The transformation matrix Hab performing the projection of
the points Pa in image “a” to the points Pb in image “b” can be
written in homogeneous coordinates as follows (points are
located on the same subjective plane):
Then:
(2)
In order to visually test the accuracy of our transformation, we
developed EyesWeb blocks which perform projective image
transformation according to an input matrix Hab. We also
developed a block which composes this projective matrix from
the correspondence of four points in the original and the
referent image.
Final Project Report #9
4
Figure 7 displays the entire transformation process. We first
selected four points in a snapshot of the IR and color video
streams using Matlab. Then, the coordinates of those points
are used by EyesWeb to compute the transformation matrix.
Finally, this matrix is used to warp the initial image: the
superposition of the color and IR image show a good
registration in the plane where the points in the two images
were chosen. Figure 8 shows how the initial video stream is
transformed into a new video stream according to a projection
matrix Hab.
Fig. 8. Left: initial video frames, right: real-time projective warping
using OpenCv-based EyesWeb block
Once the registration was visually validated, the matrix Hab
was used to compute the projective transformation for the blob
baricenters and not for the whole image. The use of a newly
implemented EyesWeb block which performs matrix
multiplication greatly reduced the computational cost: this
solution avoided the computation of the projective
transformation for all pixels in each frame of the video stream.
Figure 9 shows the rendering of the final video stream with the
surimposition of IR (green markers) and color video tracking
(red markers) which converges onto the participants hats.
Fig. 9. IR tracking of the lights on top of the red hats (green), color
tracking of the top of the red hats (red
Fig. 7. IR and color video stream registration process. From left to
right: selection of the four corresponding points in a color and IR
snapshot in Matlab, computation of the transformation matrix in
EyesWeb, image warping in EyesWeb
The projective transform blocks were achieved by using some
functions already implemented in OpenCV (Open Computer
Vision) [4]. OpenCV is a library developed by Intel with a
BSD license. EyesWeb can easily wrap OpenCV functions to
handle them as blocks in the software platform.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
B. Confidence Level
The second step performed before the fusion was to compute a
confidence level on each modality (IR and color).
The weight (confidence level) changes according to the
participants' visibility with respect to the camera's field of
view (FOV) and obstruction occurrences. Data range from 0
when the participants are not visible by the camera to 1 when
they are visible. For the color modality, the confidence level
also gradually changes between 0 and 1 depending on
blob area variations: abrupt variations due for example to
sudden illumination changes decrease the confidence level
whereas stable blob areas over time increase it.
Color video stream confidence level
Blob tracking with skin color detection can have a lack of
accuracy due to illumination variation or even a loss of
coordinates due to visibility issues (obstruction or
disappearance from the camera's FOV).
This confidence level is built on two hypotheses. The first one
is to consider location information obsolete after a short delay
(mobile objects tracking). The second hypothesis is that blob's
area abrupt variation can be related either to some undetected
surface or to unwanted detections. According to these
assumptions, the confidence level CLVID is computed as the
conjunctions of two terms:
CLVID = (blob in FOV)
(stable blob's area)
Final Project Report #9
5
C. Fusion algorithm
Single blob fusion
In order to fuse the 2D coordinates coming from the IR and
the color video stream, a weighted mean rule was applied. The
weights are the respective confidence levels previously
computed for both modalities.
Extension to multi-blob fusion
In the case of multi-blob tracking, each blob coordinates set
was extracted and computed in each modality (color and IR)
together with their confidence level.
Nevertheless, the fact that the blobs from both modalities will
remain linked to the same target during the experiment is not
obvious. To prevent this issue, our method needed to test
continuously the relation between the coordinates of the same
blobs located in the two modalities. We created a new fusion
EyesWeb block where we considered only the non null
confidence level elements and we matched each one of them
in a modality with the nearest neighbor point (following a
Euclidian distance) in the other modality. We fused these
couples with the same weighted mean rule described in the
previous section. Figure 10 shows the fusion results.
(3)
where CLVID [0, 1].
For the "blob in FOV" indicator computation, we used a clock
generator to check that the elapsed time since the last
coordinates' acquisition keeps below an acceptable delay. The
"stable blob's area" indicator is computed on a temporal
sliding window. We compared the current blob's area value
with its mobile average and returned an inverse variation rate
(minimum value between the ratio and its inverse).
Infra-red video stream confidence level
The IR light tracking might be interrupted in two situations:
the blobs might be obstructed by the occlusion of one
participant with respect to the others or they can come out
from the camera FOV. The blob detection could also be
corrupted when the tracked participants move closely. A
binary confidence level was developed to handle these
tracking issues. We measured the elapsed time since the last
valuable detection. If this delay exceeded an empirically tested
threshold, the confidence degree was set to 0 until a new
detection occurred. The confidence level CLIR is computed as:
CLIR = (blob in FOV)
(4)
where CLIR {0, 1}.
We measured the elapsed time since the last valuable
detection. If this delay is over a threshold that we fixed, “blob
in FOV” is fixed at 0 until a new detection occurs.
Fig. 10: Top-left: color image tracking, bottom-left: IR image
tracking, right: modality fusion
IV. SALIENT GESTURES: AN ATTENTION FILTER
A. Computational Attention
The aim of computational attention is to automatically predict
human attention on different kinds of data such as sounds,
images, video sequences, smell or taste, etc… This domain is
of a crucial importance in artificial intelligence and its
applications are numberless from signal coding to object
recognition. Intelligence is not due only to attention, but there
is no intelligence without attention.
Attention is also very closely related to memory through a
continuous competition between a bottom-up or unsupervised
approach which uses the features of the acquired signal and a
top-down or supervised approach which uses observer’s a
priori knowledge about the observed signal. We focused here
only on bottom-up attention due to motion.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
While numerous models were provided for attention on still
images, time-evolving two-dimensional signals as videos have
been much less investigated.
Nevertheless, some of the authors providing static attention
approaches generalized their models to the time dimension:
Dhale and Itti [5], Yee and Pattanaik [6], Parkhurst and Niebur
[7], Itti and Baldi [8], Le Meur [9] and Liu [10]. Motion has a
predominant place and the temporal contrast of its features is
mainly used to highlight important movements. Zhang and
Stentiford [11] provided a motion analysis model based on
comparing image neighborhoods in time. The limited spatial
comparison led to a “block-matching”-like approach providing
information on motion alone more than on motion attention.
Boiman and Irani [12] provided an outstanding model which is
able to compare the current movements with others from the
video history or video database. Attention is related to motion
similarity. The major problem of this approach is in its high
computational cost.
As we already stated in [13] and [14], a feature does not attract
attention by itself: bright and dark, locally contrasted areas or
not, red or blue can equally attract human attention depending
on their context. In the same way, motion can be as interesting
as the lack of motion depending on the scene configuration.
The main cue which involves attention is the rarity or the
contrast of a feature in a given context. A pre-attentive
analysis is achieved by humans in less than 200 milliseconds.
How to model rarity in a simple and fast manner?
The most basic operation is to count similar areas in the
context. Within information theory, this simple approach
based on the histogram is close to the so-called selfinformation. Let us note mi a message containing an amount of
information. This message is part of a message set M. A
message self-information I(mi) is defined as:
I ( mi )
log( p (mi ))
(5)
where p(mi) is the probability that a message mi is chosen from
all possible choices in the message set M or the occurrence
likelihood. We obtain an attention map by replacing each
message mi by its corresponding self-information I(mi). The
self-information is also known to describe the amount of
surprise of a message inside its message set: rare messages are
surprising, hence they attract our attention.
We estimate p(mi) as a two-terms combination:
p ( mi )
A(mi ) B(mi )
(6)
The A(mi) term is the direct use of the histogram to compute
the occurrence probability of the message mi in the context M:
A(mi )
H ( mi )
Card ( M )
(7)
where H(mi) is the value of the histogram H for message mi and
Card(M) the cardinality of M. The M set quantification
provides the sensibility of A(mi): a smaller quantification value
will let messages which are not the same but quite close to be
seen as the same.
B(mi) quantifies the global contrast of a message mi on the
context M:
Final Project Report #9
Card ( M )
mi
B ( mi ) 1
mj
6
(8)
j 1
(Card ( M ) 1) Max( M )
If a message is very different from all the others, B(mi) will be
low so the occurrence likelihood p(mi) will be lower and the
message attention will be higher. B(mi) was introduced to
avoid the cases where two messages have the same occurrence
value, hence the same attention value using A(mi) but in fact
one of the two is very different from the others while the other
one is just a little different.
In order to get a fast model of motion attention we have here a
three-level rarity-based approach. While the first two
approaches are bottom-up and use motion features context to
attract computer’s attention, the last one is mostly top-down
and it learns a model of the scene which will be able to modify
bottom-up attention by inhibiting some movements for
example. The three levels of motion attention we propose here
are:
Low-level instantaneous motion attention:
Motion features are compared in the spatial context of the
current perceived frame. Rare motion behaviors should
immediately pop-out and attract attention. This low-level
approach is pre-attentive (reflex) and it uses no memory
capacities.
Middle-level short-term motion attention:
Once a moving object was selected using low-level attention,
its behavior into a short temporal context is than observed.
Short-term memory (STM) is here used to save an object
motion during 2 or 3 seconds (for longer time periods, motion
details of an object are forgotten). Rare behaviors of an object
through time will be quoted as interesting while repeating
motion will be less important.
High-level long-term motion attention:
This third top-down attention approach uses long-term
memory capacities and it is a first step through motion and
scene comprehension. The attention level of each pixel
through time is accumulated which leads in areas of the scene
which concentrate attention more than the others : a street
accumulates more attention through time than a grassy area
close to it, a tree which moves because of the wind or a
flickering light will also accumulate attention through time.
The scene can thus be segmented in several areas of attention
accumulation and the motion in these areas can be
summarized by only one motion vector per area. If a moving
object passes through one of these areas and it has a motion
vector similar to the one summarizing this area, its attention
will be inhibited. If this object is outside those segmented
attention areas or its motion vector is different from the one
summarizing the area where it passes through, the moving
object will be assigned a very high attention score. This third
attention step builds an attention model learnt from
instantaneous and short-term attention steps which is able to
inhibit bottom-up attention if it corresponds to the model or to
enhance it if the motion does not match with this model.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
Final Project Report #9
7
Within this project we developed an implementation of the
two bottom-up motion attention comparing current motion
features with a spatial and a short-term temporal context. The
third top-down attention model is further discussed in section
V of this article.
B. Instantaneous motion attention
An implementation of the spatial motion rarity was achieved
as an EyesWeb XMI patch by using Equation (5) with p(mi)=
B(mi). In the scenario three people were tracked and their
instantaneous speed was used. As only 3 motion vectors were
available, the computation of the rarity A(mi) had not much
sense from a statistical point of view.
Fig. 11 shows part of the tested scenario. Three people moving
or not are present behind the cameras. Their instantaneous
velocity vectors V1, V2, and V3 are computed.
Fig. 12: The brutal speed change is more important (red) then the
speed value: the amplitudes of V(T1) and V(T4) are the same but they
have a different attention score. Similar behavior can be seen with
V(T2) and V(T3).
Fig. 11: Left: fastest moving object (V1) is the most important (hot
red), Right: slower moving object (V3) is the most important.
On the left-side V1 is very different from V2 and V3: V1 has
high speed amplitude while V2 and V3 are very slow or
stopped. In this case the faster object has a higher attention
score (hot red) compared with the lower attention score
(darker red) of V2 and V3. This situation can be compared with
the one which often occurs if a classical motion detection
module is used and where faster motion is well highlighted.
That is not the case on the right image where the most
different speed is V3 (stopped) while V1 and V2 are quite the
same (very fast). In this case, the attention score is higher (hot
red) on V3 which does not move while fast moving objects do
not attract a large amount of attention. The result of this
approach shows a different behavior compared to a simple
motion detection algorithm. This first step enables the
computer to choose the most “interesting” moving object very
efficiently and then to apply to it short-term motion attention.
Fig. 12 shows the tested scenario. One participant moves his
head from left to right very fast, then normally and finally he
stops. When a change in the speed is detected (stop to normal
speed, normal speed to high speed, high speed to stop, etc...),
the attention score is very high, but it decreases exponentially
when a stable speed occurs.
Silhouette-related features
The same algorithm as in the previous point was applied but
the feature used here was the Quantity of Motion (QoM). This
measure is obtained by integrating in time the variations of the
body silhouette (called Silhouette Motion Images - SMI).
C. Short-term motion attention
Trajectory-related features
The moving object speed was computed on a history of 3
seconds and the speed mean was computed on a 10 frames
sliding window in order to avoid a too high variability of the
speed due to segmentation and tracking noise.
The speed range was divided into 3 bins: static or very low
speed, normal speed and high speed. The Equation (5) was
applied with p(mi)= A(mi) because there was enough data
within the 3 seconds history to get a statistically reliable
occurrence likelihood p(mi). Moreover, the contrast between
the 3 speed bins is always very high so the B(mi) term is here
less important.
Fig. 13: The brutal QoM change is more important (red) than its
value: the QoM at T1 and T2 are the same but they have a different
attention score. The same behavior can be seen between T3 and T4.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
Fig. 13 shows the test case. One participant is stopped, moves
normally and moves a lot. If he moves fast while within his
temporal context he is majorly stopped, this will be
interesting. But if he moves fast while within his temporal
context he also majorly moves fast the computer will classify
this movement as uninteresting because it is repetitive and no
new information is brought.
As in section IV.B, the short-term motion attention algorithm
was implemented as an EyesWeb XMI patch.
D. Spatial and temporal motion attention fusion?
The previous two points (B and C) provide rarity-based
attention indexes for moving objects based on spatial and on
short-term temporal contexts. An interesting point is about a
possible fusion of the results of those two approaches: how
simultaneously take into account attention based on two
contexts which do not have the same nature?
It is impossible to compare a given temporal context with a
spatial one on the same basis: time and space are orthogonal
which is also confirmed by the fact that time and space
features are processed in two separate regions of the brain
[15]. Instead of trying to fuse those two kinds of attention
indexes, it seems more realistic to state that one of them
(spatial context) is pre-attentive and it occurs first while the
second one (short-term) is attentive and it focuses on the
temporal context of an object previously selected by the
spatial attention. Once this object is analyzed by the temporal
attention, another one which may spatially pop-out can be
tracked and its attention computed. Thus, spatial attention
selects potential interesting moving targets while temporal
attention verifies if those targets have also interesting
behaviors through time.
In our opinion it is much more relevant to fuse the motion
spatial attention map with a static image attention map based
on color and gray level rarity [16]. As the context is in this
case the same, it is realistic to compute rarity cues and
compare static images (as background) and motion rarity
(foreground) to get a final spatial attention which takes into
account motion but also the other static pre-attentive features
as colors or shape.
Final Project Report #9
8
compute rarity, thus attention indexes which provides
additional information compared to a simple motion detection
algorithm. We showed that motion perception can be radically
different from simple motion detection: depending of the
context, the lack of motion appeared to be more “interesting”
than motion itself. Spatial context is used to select the
participant which exhibits the highest saliency. Then, the
selected participant can be tracked on a short period of time in
order to see if he also has a salient behavior over time. This
project is a first step towards systems which have more
human-like reactions and perceive signals instead of only
detecting them.
B. Further improvements
Several improvements could be achieved in addition to
improve the already implemented system:
V. RESULTS AND DISCUSSION
Video stream synchronization
The blob detection data flows were not synchronized. A delay
could be found between the tracking achieved on the IR video
stream and the one achieved on the color video stream. This is
due to the fact that the analog cameras and the digital camera
had not the same time response. The output signal of the first
ones must be digitalized with an extern FireWire/AD
converter. We could observe a time lag between the video
output and the filmed scene.
Moreover, the computation time for the color video processing
is longer than the one needed for the IR video processing.
Finally, the computers we used for the two video stream
processing had not the same performances in terms of both
CPU and graphical card.
The data flow synchronization could be improved if both
cameras are digital and have the same characteristics. The
synchronization could be even simpler if both cameras are
acquired on the same computer using several FireWire ports.
In this case, the EyesWeb XMI possibility to synchronize
several processes on a single platform could be used and the
two data flows would be perfectly synchronized. EyesWeb
XMI highly improved the multimodal synchronization
possibilities [3]: each block has two additional pins called
“Sync-in” and “Sync-out” which can be used to propagate
synchronization clock signals between blocks.
A. Results and rendering
We developed a solution to track human movement based on
the fusion of two video modalities: a color and IR system.
This solution showed robustness with respect to the light
changing conditions and partial occlusions.
We analyzed spatio-temporal profiles of people activities at
different levels of details. At gross level, human activity was
analyzed in terms of the spatial trajectory of moving blobs
corresponding to heads. However trajectory, by itself, hardly
provided detailed information about the performed gestures
[17]. A more-detailed level of person’s activity was analyzed
in terms of full-body quantity of motion (QoM) and found
more relevant to characterize motion sequences and related
gestures.
An attention index was developed to highlight motion
saliency. Both the spatial and temporal contexts were used to
Confidence level improvement
The present confidence level (CL) characterizes how much we
should be confident in the tracked object position in each
modality (IR and color video stream) in order to fuse them
using a weighted mean. For the current implementation, we
only used two assumptions: time since the last valuable
detected position and smoothness of the detected blob
variation over time. The first way to improve the modality
confidence level could consist in adding other assumptions.
For example, as we are tracking human heads, we could
assume that abrupt speed variation is correlated with tracking
inaccuracy and to decrease the positioning CL.
An additional improvement could consist in defining a
confidence level for the fusion itself that we call fusion
confidence level (FCL). A tracked object position could have
a high CL in both modalities but the final fusion may not be
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
necessarily accurate if the two positions are very far one from
the other for example. Other elements like speed and direction
could help in defining if two positions from two different
modalities can be fused. This FCL is able to point out if the
data fusion should be considered or not.
Additional features: getting closer to human attention
An interesting improvement could also be achieved by
computing attention indexes for many other features. We can
think first about motion direction which was not taken into
account during this implementation: only the speed attention
was computed. If the speed of a moving object is very low, the
motion direction information is not very reliable because it can
be due only to detection noise. Nevertheless, if the speed is
high enough the information brought by motion direction is
very important. From an attentional point of view, if many
people have a common direction while one participant has a
different direction, this is a very important cue on objects
saliency. The same idea can be developed for temporal
attention where brutal direction changes are very interesting.
A future work will consist in reaching a more complete
description of human activities. The head tracking should be
integrated by information about overall body motion. Camurri
and all [18] revealed that bounding box variations or ellipse
inclination, that approximate 2D translation of body, can
account for expressive communication. A more detailed
analysis of upper body-part should also be accomplished.
Glowinski et al. [19] show that color-based tracking of head
and hand can reveal expressive information related to emotion
portrayals. On the basis of this refined description of human
movement, we could consider new motor cues (e.g. symmetry,
directivity, contraction, energy, smoothness…) accounting for
the communication of an expressive content [18] that could be
integrated, processed by the attention index for a more
pertinent context-based analysis of expressivity.
If the fusion between spatial and temporal attention is not
interesting (as discussed in section IV.D), it is crucial to fuse
attention information coming from several features in the
same context (space or time). This fusion can simply be done
by using the maximum operator: if a feature highly attracts
attention in a specific place, this area is very interesting at
least from this feature point of view.
High-level motion attention
In real life, when we observe a scene which does not change
every time (fixed camera for example), we build knowledge
models about those scenes. That model will have important
influences on the final attention score and it will be able to
modify attention coming from the bottom-up attention
mechanisms described in this paper. A first simple
implementation of this high level model is to use attention
accumulation as a threshold [13]: only the objects which imply
a bottom-up attention which is higher than the attention
accumulation of the model are really interesting, all the other
objects are inhibited.
Figure 14 displays an example of the results of this
implementation: a frame of the video is extracted on the topleft image. The top-right image is an attention map of the
same frame. The bottom-left image is a model of the scene:
Final Project Report #9
9
the trajectory of the moving person is visible and also is the
tree which often moves because of the wind. The bottom-right
image show the final attention map after the high-level
attention inhibition: some noisy areas (grass which also moves
because of the wind, moving person’s trajectory) are inhibited.
The moving tree area has also been inhibited and only few
attention is focused on the tree area. The moving person
remains the only area of high attention score.
Fig. 14: Top-left: current video frame, top-right: bottom-up attention,
bottom-left: scene model (top-down attention), bottom-right: final
attention map after high-level attention inhibition
This simple approach already showed that it can inhibit some
attention areas dues to noise or repetitive movements (trees
which move because of the wind, flickering lights …).
A more complex implementation of the high-level attention
model should segment the areas where a lot of attention
accumulated and summarize them with a single motion vector.
An object moving into these areas should be inhibited only if
its motion vector is close to the motion vector which
summarizes that area and its bottom-up attention score is
bellow the model local amplitude. If the motion vector of this
object is very different from the segmented attention area it
passes through, the object will not be inhibited.
C. Discussion and Conclusion
We developed a context-based framework working in a noncontrolled environment. It can deal with multiple
heterogeneous
situations
caused
by
environmental
disturbances and focus on relevant/rare events.
This system is more robust comparing to other applications
based on video-tracking commonly developed for controlled
environment. Stable environmental factors such as constant
illumination and stable background greatly facilitate human
activity monitoring. The present system can be a response to
the growing demands, human monitoring systems in many
application fields (e.g. video-surveillance, elder-people
ambient assistant living, performing arts, museum spaces, etc.)
Moreover, our system open new research perspective for
affecting computing and analysis of human expressivity.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
Results from this study actually show that context-sensitive
feature can help to better analyze gesture expressivity. The
same expressive features are differently weighted depending
on their spatial and temporal rarity. It put in evidence salient
human motion either at the level of a single individual (rarity
of behavior over time) or at the collective level (relative rarity
of one member’s behavior with respect to the others).
We plan to further investigate the potentialities of the rarity
index in two directions: (i) by applying it to a more
sophisticated set of expressive features (.e.g contraction,
expansion, fluidity, impulsiveness) (ii) by analyzing how the
visual feedback computed on the rarity index affect the subject
behavior (e.g. whether it fosters expressive behavior).
ACKNOWLEDGMENTS
This work has been achieved in the framework of the eNTERFACE
2008 Workshop at the LIMSI Lab of the Orsay University (France).
It was also included in the Numédiart excellence center
(www.numediart.org) project 3.1 funded by the Walloon Region,
Belgium. The authors thank to Johan DECHRISTOPHORIS who
helped in IR sensor hardware components set-up. Finally, this work
has been partially supported by the Walloon Region with projects
BIRADAR, ECLIPSE, and DREAMS, and by EU-IST Project
SAME (Sound And Music for Everyone Everyday Everywhere Every
way).
Final Project Report #9
10
[13] Mancas, M., "Computational Attention: Towards Attentive
Computers", Similar edition, CIACO University Distributors,
ISBN : 978-2-87463-099-6, 2007
[14] Mancas M., Gosselin B., Macq B., "A Three-Level
Computational Attention Model", Proc. of ICVS Workshop on
Computational Attention & Applications, 2007, Germany.
[15] Hubel, D.H., “Eye, brain and vision”, New York: Scientific
American Library, N°22, 1989
[16] Mancas, M., “Image perception: Relative influence of bottom-up
and top-down attention”, Proc. of the WAPCV workshop of the
ICVS conference, Santorini, Greece, 2008
[17] Velastin, S., Boghossian, B., Lo, B., Sun, J., Vicencio-Silva, M.:
Prismatica: toward ambient intelligence in public transport
environments. IEEE Trans. Syst.Man Cybern. Part A 35(1),
164–182, 2005
[18] Camurri, A., De Poli, G., Leman, M., and Volpe, G., 2005.
"Toward Communicating Expressiveness and Affect in
Multimodal Interactive Systems for Performing Art and Cultural
Applications", IEEE Multimedia, 12,1, 43-53, 2005
[19] Glowinski, D., Camurri, A., Coletta, P. , Bracco, F., Chiorri, C.,
Atkinson, A. An investigation of the minimal visual cues
required to recognize emotions from human upper-body
movements, (in press) Proceedings of the ACM 2008
International Conference Conference on Multimodal Interfaces
(ICMI), Workshop on Affective Interaction in Natural
Environments (AFFINE) Crete, 2008
REFERENCES
[1] Hongeng, S. and Nevatia, R., "Multi-agent event recognition" in
ICCV, 2001, pp. II: 84-91
[2] eNTERFACE 2008 Workshop website:
http://enterface08.limsi.fr/
[3] A. Camurri, A., Coletta, P., Varni, G., & Ghisio, S. “Developing
multimodal interactive systems with EyesWeb XMI”,
Proceedings of the 2007 conference on new interfaces for
musical expression (NIME07) (pp. 305-308). New York,
USA, 2007
[4] OpenCV static wiki:
http://opencvlibrary.sourceforge.net/wiki-static
[5] Dhavale, N., and Itti., L., “Saliency-based multi-foveated mpeg
compression”, IEEE Seventh International Symposium on
Signal Processing and its Applications, 2003
[6] Yee, H., and Pattanaik, S., “Spatiotemporal sensitivity and visual
attention for efficient rendering of dynamic environments”,
IACM, 2001
[7]
Parkhurst, D.J., Niebur, E., “Texture contrast attracts overt
visual attention in natural scenes”, European Journal of
Neuroscience, 19:783–789, 2004
[8] Itti, L., Baldi, P., “Bayesian Surprise Attracts Human Attention”,
Advances in Neural Information Processing Systems, Vol. 19
(NIPS 2005), pp. 1-8, Cambridge, MA:MIT Press, 2006
[9] Le Meur, O., Le Callet, P., Barba, D., and Thoreau D. “A
Coherent Computational Approach to Model Bottom-Up Visual
Attention”, PAMI(28), No. 5, pp. 802-817, 2006
[10] Liu, F., and Gleicher, M., “Video Retargeting: Automating Panand-Scan”, ACM Multimedia, 2006
[11] Zhang, V., and Stentiford, F.W.M., “Motion detection using a
model of visual attention”, IEEE ICIP, 2007
[12] Boiman, O., Irani, M., “Detecting Irregularities in Images and in
Video”, International Conference on Computer Vision (ICCV),
2005
Matei Mancas.
Matei Mancas was born in Bucarest in 1978. He holds an
ESIGETEL (Ecole Supérieure d’Ingénieurs en informatique
et TELecommunications, France) Audiovisual Systems and
Networks engineering degree, and a Orsay University
(France) MSc. degree in Information Processing. He also
holds a PhD in applied sciences from the FPMs (Engineering
Faculty of Mons, Belgium) on computational attention since
2007.
His past research interest is in signal and, in particular, image processing.
After a study on nonstationary shock signals in industrial tests at MBDA
(EADS group), he worked on medical image segmentation. He is now a
Senior Researcher within the Information Processing research center of the
Engineering Faculty of Mons, Belgium. His major research field concerns
computational attention and its applications.
Donald Glowinski.
Donald Glowinski, Paris, 27-02-1977. He is doing a Phd in
computing engineering at InfoMus Lab – Casa Paganini, in
Genoa, Italy. (dir: Prof. Antonio Camurri). His background
covers scientific and humanistic academic studies as well as
high-level musical training.- EHESS (Ecole des Hautes Etudes
en Sciences Sociales) MSc, in Cognitive Science, CNSMDP
(Conservatoire National Supérieur de Musique et de Danse de
Paris) MSc in Music and Acoustics, Sorbonne-Paris IV MSc.
in Philosophy.
He was chairman of the Club NIME 2008 (New Interfaces for Musical
Expression), Genoa, 2008. His research interests include multimodal and
affective human-machine interactions. He works in particular on the modeling
of automatic gesture-based recognition of emotions.
eNTERFACE’08, August 4th – August 29th, Orsay-Paris, France
Pierre Bretéché.
Pierre Bretéché, Ivry sur Seine, 1981, received a MSc.
degree in Computer Science in 2006 at University of Rouen
(France). He is doing a PhD in Information and
Communication Sciences at Laseldi Lab – University of
Franche-Comté in Montbéliard (France). He previously
worked in AI with a using massive multi-agent system to
build semantic picturing of an environment.
He is now part of Organica research project. His research interest has moved
to studying and designing new technology applications for public, cultural and
artistic purpose.
Jonathan Demeyer.
Jonathan Demeyer received a MSc. in electrical engineering,
Université Catholique de Louvain, Louvain-la-Neuve,
Belgium in 2005 and a MSc. in applied sciences from the
FPMs (Faculté Polytechnique de Mons, Belgium) in 2008.
He previously worked in the TCTS lab (Faculté
Polytechnique de Mons, Belgium) developing a mobile
reading assistant for visually impaired people.
He is now working on a project on automatic processing of high speed
videoglottography for physicians. His main interests are medical image
processing and computer vision.
Thierry Ravet.
Thierry Ravet was born in Brussels, Belgium on 31st
Augustus 1976. He received an Electrical Engineering
degree in 1999 in ULB (Université Libre de Bruxelles).
Afterwards, he worked as researcher in the Electronic –
Microelectronic - Telecommunication department of ULB
for 4 years with medical instrumentation projects.
Since February 2008, he has joined the TCTS Lab (Faculté Polytechnique de
Mons). His main experiences are in non-invasive instrumentation,
microprocessor system development, artefacts filtering and data fusion in the
field of cardiorespiratory monitoring and polysomnographic research.
Gualtiero Volpe.
Gualtiero Volpe, Genova, 24-03-1974, PhD, computer
engineer. He is assistant professor at University of Genova.
His research interests include intelligent and affective
human-machine interaction, modeling and real-time analysis
and synthesis of expressive content in music and dance,
multimodal interactive systems.
He was Chairman of V Intl Gesture Workshop and Guest Editor of a special
issue of Journal of New Music Research on “Expressive Gesture in
Performing Arts and New Media” in 2005. He was co-chair of NIME 2008
(New Interfaces for Musical Expression), Genova, 2008. Dr. Volpe is member
of the Board of Directors of AIMI (Italian Association for Musical
Informatics).
Antonio Camurri.
Antonio Camurri (born in Genova, 1959; ’84 Master
Degree in Electric Engineering; 1991 PhD in Computer
Engineering) is Associate Professor at DIST-University of
Genova (Faculty of Engineering), where he teaches
“Software engineering” and “Multimedia Systems”. His
research interests include multimodal intelligent interfaces,
non-verbal emotional and expressive communication,
kansei information processing, multimedia interactive
systems, musical informatics.
He was President of AIMI (Italian Association for Musical Informatics), is
member of the Executive Committee (ExCom) of the IEEE CS Technical
Committee on Computer Generated Music, Associate Editor of the
international “Journal of New Music Research”, a main contributor to the EU
Roadmap on “Sound and Music Computing” (2007). He is responsible of EU
Final Project Report #9
11
IST Projects at DIST InfoMus Lab of University of Genova. He is author of
more than 80 international scientific publications. He is founder and scientific
director of the InfoMus Lab at DIST-University of Genova
(www.infomus.org). Since 2005 he is scientific director of the Casa Paganini
centre of excellence at University of Genoa on research in ICT integrating
artistic research, music, performing arts, and museal productions
(www.casapaganini.org).
Paolo Coletta.
Paolo Coletta born in Savona, Italy, in 1972. He received
the “Laurea” degree in computer science engineering in
1997 and the Ph.D. degree in electronic engineering and
computer science in 2001, both from the University of
Genoa, Genoa, Italy.
From January 2002 to December 2002, he was a Research
Associate at the Naval Automation Institute, CNR-IAN
(now ISSIA) National Research Council, in Genoa.
In 2004 he was adjunct Professor of "Software Engineering and Programming
Languages" at DIST (Dipartimento di Informatica, Sistemistica e Telematica),
University of Genoa. Since 2004 he is a collaborator at DIST, University of
Genoa.