INTERSPEECH 2007
Computer-Supported Human-Human Multilingual Communication
Alex Waibel†‡
†
†
with: Keni Bernardin and Matthias Wölfel
InterACT
International Center for Advanced Communication Technology
†
Universität Karlsruhe (TH), Karlsruhe, Germany
‡
Carnegie Mellon University, Pittsburgh, PA, USA
ahw@cs.cmu.edu
technology could be “CHIL enabled” in this way, a host of
new services could potentially be possible. Could two people
be connected with each other at the best moment over the
most convenient and best media, without phone tag,
embarrassing ring tones and interruptions? Could an attendee
in a meeting be reminded of participants’ names and
affiliations at the right moment without messing with a
contact directory? Can meetings be supported, moderated
and coached without technology getting in the way? And:
Could computers enable speakers of different languages
communicate and listen to each other gracefully across the
language divide?
Human assistants often provide such services; they work out
logistical support, reminders, helpful assistance, and language
mediation; they can do it reliably, transparently, tactfully,
sensitively and diplomatically. Why not machines? Clearly,
a lack of recognition and understanding of human activities,
needs and desires are to blame, and an absence of socially
adept computing services that mediate rather than intrude. In
the following we focus on these two elements: 1.)
technologies to track and understand the human context, and
2.) computing services, that mediate and support humanhuman interaction.
Abstract
Computers have become an essential part of modern life,
providing services in a multiplicity of ways. Access to these
services, however, comes at a price: human attention is bound
and directed toward a technical artifact in a human-machine
interaction setting at the expense of time and attention for
other humans. This paper explores a new class of computer
services that support human-human interaction and
communication implicitly and transparently. Computers in
the Human Interaction Loop (CHIL), require consideration of
all communication modalities, multimodal integration and
more robust performance. We review the technologies and
several CHIL services providing human-human support.
Among them, we specifically highlight advanced computer
services for cross-lingual communication.
Index Terms: speech to speech, machine translation,
simultaneous translation, domain-independence, multimodal
interaction, perceptual user interfaces, language portability.
1.
Introduction
It is a common experience in our modern world, for humans
to be overwhelmed by the complexities of technological
artifacts around us, and by the attention they demand. While
technology provides wonderful support and helpful
assistance, it also gives rise to an increased preoccupation
with technology itself and with a related fragmentation of
attention. But as humans, we would rather attend to a
meaningful dialog and interaction with other humans, than to
control the operations of machines that serve us. The cause
for such complexity and distraction, however, is a natural
consequence of the flexibility and choices of functions and
features that the technology has to offer. Thus flexibility of
choice and the availability of desirable functions are in
conflict with ease of use and our very ability to enjoy their
benefits. The artifact cannot yet perform autonomously and
requires precise specification of the machine’s behavior.
Standardization, better graphical user interfaces, multimodal
human-machine dialog systems, speech, pointing, mousing
have all contributed to improve the interface, but still force
the user to interact with a machine at the detriment of other
human-human interaction.
To change the limitations of present day technology,
machines must engage implicitly and indirectly in a world of
humans, that is we must put Computers in the Human
Interaction Loop (CHIL), rather than the other way round.
Computers should assist humans engaged in human-human
interaction, by providing implicit and proactive support. If
2.
Understanding the Human Context
In contrast to classical human-machine interfaces, implicit
computer support for human-human interaction requires a
perceptual user interface with much greater performance,
flexibility and robustness, than is available today. This
challenge has lead to research aimed at tracking all the
communication modalities in realistic recording conditions,
rather than individual modalities in idealized recording
conditions. CHIL and AMI, both Integrated projects under
the 6th Framework Program of the European Commission, as
well as CALO, a DARPA program are among the more recent
efforts aiming to take on this challenge.
In the following we will discuss computer services that
support human-human interaction. To realize this goal, work
concentrates on four key areas: The creation of robust
perceptual technologies able to acquire rich and detailed
knowledge about the human interaction context; the
collection and annotation of realistic, audio-visual meeting
and seminar data necessary for the development and
systematic evaluation of such; the definition of a common
software architecture to support reusability and
exchangeability of services and technology modules; the
implementation of a number of prototypical services offering
proactive, implicit assistance based on the gained awareness
about human interactions.
14
August 27-31, Antwerp, Belgium
2.1.
2.1.1.
positions in space, provide a better “coverage” of each area of
interest. Fusion of sensor data in multi-view approaches
overcomes occlusion problems, as in the case of 3D
background subtraction techniques combined with shape from
silhouette [3]. Probabilistic approaches computing the product
of single view likelihoods using generative models which
explicitly model occlusion have proved efficient in managing
the trade-off between reliable modeling and computational
efficiency [4] (see also Figure 2). Fusion of multimodal data
for speaker localization in e.g. particle filtering approaches
increases robustness for speaker tracking [5]. Efficient
tracking is a useful building block for all subsequent
technologies. It has been shown, e.g. that multimodal fusion
helps increase localization accuracy, and that this in turn has
direct impact on the performance of far-field speech
recognition [6,7] (see also Figure 3).
Audio-visual Perceptual Technologies
Introduction
Multimodal interface technologies “observe” humans and
their environments by recruiting signals from multiple AV
sensors to detect, track, and recognize human activity. The
analysis of all AV signals in the environment (speech, signs,
faces, bodies, gestures, attitudes, objects, events, and
situations) provides the proper answers to the basic questions
of “who”, “what”, “where”, and “when”, that can drive
higher-level cognition concerning the “how” and “why”, thus
allowing computers to engage and interact with humans in a
human-like manner using the appropriate communication
medium (see Figure 1).
Figure 2: Audio-visual tracking of multiple persons.
Targets are described by an appearance model
comprising shape and color information, and tracked
in 3D using probabilistic representations [4]. The
system tracks 5 people in real-time through multiple
persistent occlusions in cluttered environments.
Figure 1: The “who”, “what”, “where”, “when”,
“how” and “why” of human interaction
Research work performed and progress made on a number of
such technologies is described next. Whereas technological
advances for multimodal systems were hard to measure in the
past for lack of common benchmarks, recent efforts in the
community have led to the creation of international
evaluations such as the CLEAR (Classification of Events,
Activities and Relationships) [1] and RT (Rich Transcription)
[2] evaluations, which offer a platform for large-scale,
systematic and objective performance measurements on large
audio-visual databases.
2.1.2.
Person Tracking
Location and tracking of multiple persons behaving without
constraints, unaware of audio/video sensors, in natural,
evolving and unconstrained scenarios, still poses significant
challenges.
Video-based approaches based on background subtraction are
error prone due to varying illumination, shadows and
occlusion, whereas those relying on the feature space (e.g.
color histograms) are difficult to initialize reliably for every
new acquired target. Many approaches that offer higher
reliability are simply too computationally expensive to be
used in online applications.
Audio-based localization and tracking requires the tracked
person to be actively speaking, and have to deal with the
variety of acoustic conditions (e.g., room acoustics and
reverberation) and, in particular, the undefined number of
simultaneous active noise sources and competing speakers
found in natural scenarios.
Several strategies are being applied to face the challenges
mentioned above. Distributed camera and microphone
networks, including microphone arrays placed in different
Figure 3: Acoustic, visual and multimodal 3D person
tracking accuracies and resulting word error rate
(after beamforming) on the CHIL 2005 dataset.
2.1.3.
Person Identification
The challenges for audio-visual person identification (ID) in
unconstrained natural scenarios are due to far-field, wideangle, low-resolution sensors, acoustic noise, speech overlap
and visual occlusion, unpredictable subject motion, and the
lack of position/orientation assumptions to facilitate wellposed signals. Clearly, employing tracking technologies and
fusion techniques, either temporal, multi-sensor or multimodal (speaker ID combined with face ID for example) is a
viable approach in order to improve robustness.
Identification performance depends on the enabling
technologies used for audio, video and their fusion, but also
on the accuracy of the extraction of the useful portions from
15
level isolated events such as desk activity, to complex
activities, such as leaving a room and entering another, could
be achieved. The event classes were learned by clustering
audio-visual data recorded during normal office hours over
extended periods of time. Figure 6 depicts an example of
data-driven clustering of activity regions within an office.
the audio and video streams. The detection process for audio
involves finding and extracting the speech segments in the
audio stream. The corresponding process for video involves
face detection. Developed mono- and multi-modal ID systems
within CHIL have been successfully evaluated in the
CLEAR’06 and ’07 evaluations [1], reaching in many cases
near 100% accuracies on databases of more than 25 subjects.
Not only was steady progress made on the key technologies
over the past years, showing the feasibility of person ID in
unconstrained environments, it was also demonstrated that
sensor and multimodality fusion help to improve recognition
robustness (see Figure 4)
100
95
Recognition rate (%)
90
Audio 2006
Video 2006
A/V 2006
Audio 2007
Video 2007
A/V 2007
85
Figure 5: Estimating Head Pose and Focus of
Attention [9]. Head orientations are estimated from
four camera views. These are then mapped to likely
focus of attention targets, such as room occupants
80
75
70
0
2
4
6
8
10
12
Test duration (sec)
14
16
18
20
Figure 4: Acoustic, visual and multimodal
identification results for the CLEAR 2006 and 2007
evaluations (only best results shown). Systems were
trained on 15 second sequences and tested on 1, 5, 10
and 20 second test sequences. Shown are accuracies
for 25 users from 5 sites.
2.1.4.
Head Pose, Focus of Attention
Understanding human interaction requires not only to
perceive the state of individuals, but also to determine their
person or object of interest, the addressees of speech, and so
forth. Since people’s head orientations are known to be
reliable indicators for their direction of attention [8], systems
were developed to estimate the head orientations of people in
a smart room using multiple fixed cameras (see also Figure
5). In the CLEAR 2006 head pose dry run evaluation, the first
formal evaluation for a task of this kind, classification of pan
angles into 45° classes was attempted and accuracies of
44.8% were reached [1]. The challenging CHIL database
drove the development of more accurate systems and already
in 2007, estimation of exact angles was performed and error
rates as low as 7° pan, 9° tilt and 4° roll could be achieved.
Once head orientations are estimated, they can be used to
automatically determine the foci of attention of people [9].
2.1.5.
Figure 6: Data-driven training of activity regions in
an office room[10]. The regions labeled as a), b) and
c) represent the learned areas of activities by office
workers and their visitors, whereas d) depicts all
resulting clusters. Evaluation of an unconstrained one
week recording session revealed accuracies of 98%
for ”nobody in office”, 86% for “paperwork”, 70%
for “phone call” and 60% for “meeting”
2.1.6. Speech Activity Detection, Speaker Diarization
These two related technologies are relevant not only for
Automatic Speech Recognition (ASR), but also for speech
detection and localization and for speaker identification.
Speech activity detection (SAD), addresses the “when’’ of the
speech interaction, and speaker diarization, addresses both
“who’’ and “when”. Both have been evaluated on the CHIL
interactive seminar database in the latest CLEAR and RT
evaluations, using primarily far-field microphones.
Activity Analysis, Situation Modeling
Another useful type of information for unobtrusive, contextaware services is the classification of a user’s or a group’s
current activities. In experiments performed in one of the
CHIL sites, typical office activities such as “paperwork”,
“meeting” or “phone call” were distinguished in a multipleoffice setup using only one camera and one microphone per
room [10]. A hierarchical classification ranging from low
16
2.1.7.
head orientations, low resolution images, occlusion, moving
people, varying speaking accents, behaviours, room layouts
and technical sensor setups.
Starting in 2006, a large effort was undertaken to create an
international forum for evaluation of multimodal technologies
for the analysis of human activities and interactions. The
CLEAR workshop was created in a joint effort between CHIL
[11], the US National Institute of Standards and Technology
(NIST) and the US Video Analysis Content Extraction
(VACE) [12] program. The goal was to provide the needed
discussion forums, databases, standards, and benchmarks
necessary to drive the development of multimodal perceptual
technologies, much like the NIST Rich Transcription Meeting
Recognition (RT) workshop for diarization, speech detection
and recognition, or the TRECVID [13], PETS [14] and
ETISEO [15] programs for visual analysis and surveillance.
More than a dozen evaluation tasks were conducted,
including face and head tracking, multimodal 3D person
tracking, multimodal identification, head pose estimation,
acoustic scene analysis, acoustic event detection, etc.
To offer support for the integration of developed
technological components, to realize higher level fusion of
information and modeling of interaction situations, and to
provide well-defined interfaces for the design of useful user
services, a proper architectural framework is of great
importance. An example of such an infrastructure is the CHIL
Architecture [16]
Recognition of Speech and Acoustic Events
Speech is the most critical human communication modality in
seminar and meeting scenarios, and its automatic transcription
is of paramount importance to real-time support and off-line
indexing of the observed interaction. Although automatic
speech recognition (ASR) technology has matured over time,
natural unconstrained scenarios present significant challenges
to state-of-the-art systems. For example, spontaneous and
realistic interaction, with often accented speech and
specialized topics of discussion (e.g., technical seminars), as
well as overlapping speech, interfering acoustic events, and
room reverberation degrade significantly the ASR
performance. These factors are further exacerbated by the use
of far-field acoustic sensors, which is unavoidable in order to
free humans from tethered and obtrusive close-talking
microphones.
Various research sites have been developing ASR systems to
address these challenges, and have benchmarked their
performance, e.g. in the recent RT’06 and ‘07 evaluations.
There, the best far-field ASR system achieved a word error
rate (WER) of 44% (52% in 2006), by combining signals
from multiple (up to four) table-top microphones. It is
interesting to note that this is considerably higher than the
31% (also 31% in 2006) WER achieved on close-talking
microphone input – with manual segmentation employed to
remove unwanted cross-talk. These results demonstrate the
extremely challenging nature of the task at hand.
Various research approaches are being currently investigated
to improve far-field ASR. Some employ multi-sensory
acoustic input, for example beamforming that aims to
efficiently combine acoustic signals from microphone arrays
[6], and speech source separation techniques that attempt to
improve performance during speech overlap segments. A
different multimodal approach considered is to recruit visual
speech information from the speaker lips, captured from
properly managed pan-tilt-zoom cameras, in order to improve
recognition through AV-ASR.
Finally, one should note that speech is only one of the
acoustic events occurring during human interaction scenarios.
Technology is being developed to detect and classify acoustic
events that are informative of human activity, i.e., clapping,
keyboard typing, door closing, etc. [1].
2.2.
2.3.
Human-Human Computer Support Services
Building on the perceptual technologies and compliant to the
software architecture, several prototypical services are being
developed that instantiate the vision of context-awareness and
proactiveness for supporting human-human interaction.
The target domains are lectures and small office meetings. In
the following, some example services, relying on the robust
perception of human activities and interaction contexts are
presented:
2.3.1.
The Meeting Browser
The Meeting Browser provides functionality for offline
reviewing of recorded meetings, automatic analysis,
intelligent summarization or data reduction, generation of
minutes, topic segmentation, information querying and
retrieval, etc. Although it has been a topic of research for
quite some time [17,18], advances in perceptual technologies
(such as face detection, speaker separation and far-field
speech recognition) have increased its user-friendliness by
reducing the constraints on the interaction participants or the
need for controlled or scripted scenarios.
Technology Evaluations, Data Collection &
Software Architecture
To drive rapid progress of the presented audio-visual
perceptual technologies, their systematic evaluation using
large realistic databases and common task definitions and
metrics is essential.
Technology evaluations, undertaken on a regular basis, are
necessary so that improvements can be measured objectively
and different approaches compared. An important aspect is to
use real-life data covering the envisioned application
scenarios. In CHIL, large numbers of seminars and meetings
were collected in five different smart rooms, equipped with a
range of cameras and microphones. The recordings were
manually enriched with acoustic event and speech
transcriptions as well as several visual annotations that
allowed to train and evaluate various technology components
(see for example [1] for further details). In contrast to many
of the evaluation benchmarks that exist for individual
technologies such as face recognition, for example, the data
from such realistic scenarios is extremely challenging,
containing a combination of many difficulties for perceptual
technologies, such as varying illumination, viewing angles,
2.3.2.
The Collaborative Workspace
The Collaborative Workspace (CW) [19] is an infrastructure
for fostering cooperation among participants. The system
provides a multimodal interface for entering and manipulating
contributions from different participants, e.g., by supporting
joint discussion of minutes or joint accomplishment of a
common task, with people proposing their ideas, and making
them available on the shared workspace, where they are
discussed by the whole group.
2.3.3.
The Connector
The Connector is an adaptive and context-aware service
designed for both efficient and socially appropriate
communication [20]. It maintains an awareness of users'
17
Europe in ’03, ’04, and ’06, respectively, in response. In the
following we review these advances.
activities, preoccupations, and social relationships to mediate
a proper moment and medium of connection between them.
2.3.4.
The Memory Jog
3.1.
The Memory Jog (MJ) provides background information and
memory assistance to its users. It offers "now and here"
information by exploiting either external databases: (Who is
this person? Where is he/she from?) or own ones (Who was
there that day? What did he say?), the latter including
information gained from the observation of the interaction
context [21]. The MJ can exploit its context-awareness to
proactively provide information at the proper time and in the
most convenient way given the current situation.
2.3.5.
Cross-Lingual Communication Services
Another exciting class of services concern cross-lingual
human-human communication. Is it possible to communicate
with a fellow human speaking a different language as
naturally as if he/she spoke your own? Clearly this would be
a worthwhile vision in a globalizing world, when
international integration demand limitless communication,
while national identity and pride demand recognition and
respect for the cultural and linguistic diversity on this planet.
How could technology be devised to make this possible? We
devote the following section to a discussion of this potentially
revolutionary class of human communication support and an
area of growing speech, language and interface research.
3.
Domain-Limited Portable Speech Translators
Fieldable speech-to-speech translation systems are currently
developed around portable platforms (laptops, PDA’s) which
impose constraints on the ASR, SMT, and TTS components.
For PDA’s memory limitations and the lack of a floating
point unit require substantial redesign of algorithms and data
structures. Thus, a PDA implementation may impose WER
increases from 8.8% to 14.6% [24] over laptops. In addition
to continued attention to speed, recognition, translation and
synthesis performance, however, usability issues such as the
user interface, microphone type, place and number, as well as
user training and field maintenance must be considered. One
of the resulting speech-to-speech graphical user interfaces
(GUI) of a PDA pocket translators is shown in Figure 7.
Cross-Lingual Human-Human
Communication Services
In the past decade, Speech Translation has grown from an
oddity at the fringe of speech and language processing
conferences, to one of the main pillars of current research
activity. The explosion in interest is driven in part, by
considerable market pull from an increasingly globalizing
world, where distance is no longer measured in miles but in
communication ease and cost. Indeed, effective solutions that
overcome the linguistic divide may potentially offer
considerable practical and economic benefits. For the
research community, the linguistic divide may ultimately
prove to be a more formidable challenge than the digital
divide as it presents researchers with a number of fascinating
new problems. The goal is, of course, good human-to-human
communication without interference from technical artifacts,
and effective solutions must combine efficient and reliable
speech & language processing with effective human factors
and interface design.
Early developments provided first prototypes demonstrating
the concept and feasibility [22,23]. In the mid ‘90’s a number
of projects aiming at spontaneous speech two-way speech
translators for limited domains (e.g. JANUS-III, Verbmobil,
Nespole) followed suit. The Consortium for Speech
Translation Advanced Research (C-STAR) was founded in
‘91 to promote international cooperation in speech translation
research. With the turn of the millennium, activity has
proceeded in two directions: The first continues to improve
domain-limited two-way translation toward fieldable, robust
deployment where domain limitation is acceptable
(humanitarian, health-care, tourism, government, etc.). The
second has begun to tackle the open challenge of domain
limitation for applications such as broadcast news, speeches
and lectures. Large new initiatives (NSF-STR-DUST, EC-IP
TC-STAR and DARPA GALE were launched in the US and
Figure 7: A PDA pocket translator [English-Thai] 1.
The GUI window is divided into two regions, showing the
language pairs. These regions can be populated by recognized
speech output (ASR), translation output (SMT), or by a
virtual PDA keyboard for backup. A back-translation is
provided for verification; a push-to-talk button activates the
device and aborts processing for false starts and errors.
Projects (e.g DARPA Transtac) and workshops (e.g. IWSLT,
sponsored by C-STAR) provide for collaboration, data
exchange and benchmarking that improve performance and
coverage in this space.
3.2. Translation of Parliamentary Speeches and
Broadcast News
For speech-translation without domain limitation, component
technologies first had to be developed that deliver acceptable
ASR, SLT (and TTS) performance in face of spontaneous
speech, unlimited vocabularies, broad topics, and speaking
style characteristic of spoken records. In TC-STAR, speeches
from the European Parliament (and their manual
transcriptions and translations) were used as data to train and
evaluate. Figure 8 shows the improvements over the years in
speech recognition and automatic translation within the
project. In these experiments it has been seen that there is
almost a linear correlation between WER and machine
translation quality. We also found that a WER of around 30%
is influencing the machine translation quality significantly
while a WER of 10% provides for reasonable translation
1
18
Courtesy of Mobile Technologies, LLC, Pittsburgh
compared to reference transcriptions. The goal of a different
ambitious speech translation project, GALE (Global
Autonomous Language Exploitation) [25], is to provide
relevant information in English, where the input comes from
huge amounts of speech in multiple languages (a particular
focus is on broadcast news in Arabic and Chinese). However,
progress is not measured by WER and BLEU, but how fast a
particular goal can be reached.
Figure 9 compares human and computer speech-to-speech
translations on five different aspects by human judgment: was
the message understandable (understanding), was the output
text fluent (fluent speech), how much effort does it take to
listen to the translation (effort) and what is the overall quality,
where the scale ranges from 1 (very bad) to 5 (very good).
The fifth result shows the percent accuracy by which
questions of content could be answered by human subjects
based on the output from human and machine translators. It
can be seen that automatic translation quality still lags behind
human translation, but reaches usable and understandable
levels already close to human translations. It is interesting to
note, that the human translations also fall short of perfection
due to the fact that humans translators occasionally omit
information.
Figure 10: BLEU scores show good correlation with
human judgements (fluency & accuracy) for English
to Spanish translations. (source [27])
An important aspect in all automatic evaluations are good
metrics that can be evaluated automatically and repetitively.
While WER is an established method to measure accuracy of
automatic speech transcriptions, automatic MT metrics have
only recently been proposed. Figure 10 shows the BLEU
score (one of several popular MT scoring metrics) and its
good correlation with human judgements (adequacy, fluency)
on the European Parliament data.
60
50
3.3.
40
30
20
10
0
2004
BLEU
2005
WER
2006
2007
Figure 8: Improvements in Speech Translation and
Automatic speech recognition over the years on
English EPPS and translation into Spanish. (source
[26,27])
3.3.1.
Figure 9: Human vs.
performance. (source [28])
automatic
Unlimited Domain Simultaneous Translation
The ultimate cross-lingual communication tool would be a
simultaneous translator that produces simultaneous real-time
translation of spontaneous lectures and presentations.
Compared to parliamentary speeches and broadcast news,
lectures, seminars, presentations of any kind, present further
problems for domain-unlimited speech translation by
•
Spontaneity of free speech, the disfluencies, the illformed nature of spontaneous natural discourse
•
Specialized vocabularies, topics, acronyms, named
entities and expressions in typical lectures and
presentations (by definition specialized content)
•
Real-time and low-latency requirements and on line
adaptation to achieve simultaneous translation and
•
Selection of translatable chunks or segments
The Lecture Translator
To address these problems in ASR and MT engines, changes
to an off-line system are introduced as follows:
•
To speed up recognition, acoustic models can be
adapted to a particular speaker. The size of the
acoustic model is restricted (for additional speed up
when evaluating the Gaussian mixture model one
can use techniques such as Gaussian selection) and
the search space is more rigorously pruned.
•
To adapt to particular speaker style and domain, the
language model is tuned offline on slides and
publications provided by the speaker, either by
reweighting available text corpora or by retrieving
relevant training material through the internet or on
previous lectures given by the same speaker.
•
As almost all MT systems are trained on data split
at sentence boundaries and therefore ideally expect
sentence like segments as input, particular care has
to be taken for suitable online segmentation. We
have observed that extreme deviations from
translation
19
displayed on a common screen, a personalized PDA
screen or acoustically via head-phones.
sentence based segmentation can lead to significant
decreases in performance. In view of minimizing
overall system latency, however, shorter speech
segments are preferred. In addition to providing
efficient phrase translation on-the-fly, word-toword alignment is optimally constrained for entire
sentence pairs[29].
3.4.
Word Error Rate
35
30
25
20
15
10
5
0
EPPS
Lecture
Translation
Lectures
The Long Tail of Language
With promising solutions to the language divide under way,
language portability remains the unsolved issue. At current
estimates, there are more than 6,000 languages in the world,
but language technology is only being developed for the most
populous or wealthy languages of the world. Most languages
along the long tail of language (Figure 12) remain
unaddressed. Overcoming the language divide thus requires
workable solutions to providing solutions to the long tail of
language, at reasonable cost. Most current research is
focused on improving cross-lingual technology by employing
ever larger data, personnel or computational resources. To
address the long tail of language, an orthogonal direction
should be concerned with making do with less at lower cost.
Meetings
Domain
Figure 11: Current performance of speech
recognition systems on different domains (source
[28,30,31], black = speaker independent off line
system, gray = speaker dependent online system)
Figure 11 compares WERs on different domains for English.
With a tweaked speaker dependent lecture recognition system
we reach a sufficient good performance of 10% WER. On an
end-to-end evaluation of the system from English into
Spanish we got a BLEU score of 19 while on reference
transcripts we got a score of 24 (source [30]).
3.3.2.
Figure 12: The long tail of languages
At our center, we are therefore exploring several intriguing
possibilities that lower cost that could some day bring this
problem within reach as well:
• Language independent or adaptive components (this was
demonstrated already for acoustic modeling[33]
• More selective parsimonious use of data and data
collection [34]
• Interactive and implicit training by the user [35]
• Training on simultaneously spoken translation thereby
eliminating the need for parallel text corpora [36]
Delivering Translation Services
(Output Modalities)
Aside from speech and language challenges, lecture
translation also presents human factor challenges, as the
service should be provided unobtrusively, i.e., with minimal
interference
or
disruption
to
the
human-human
communication. Several options are being explored:
•
Subtitles: Simultaneous translations can be
projected to the wall as subtitles. This is suitable if
the number of output languages is small.
•
Translation goggles: Heads-up display goggles that
display translation text as captions in a pair of
personalized goggles.
Such goggles provide
unobtrusive translation and exploit the parallelism
between the acoustic and visual channel. This is
particularly useful, if listeners have partial
knowledge of a speaker’s language and wish to
obtain complementary language assistance.
•
Targeted Audio Speakers: Under the project CHIL,
a set of ultra-sound speakers with high directional
characteristics has been explored, that can provide a
narrow audio beam to an individual listener or a
small area in the audience, where simultaneous
translation is required. Since such speakers are
only audible in a narrow area, it does not disturb
other listeners, or could be complemented by
similar translation services into other languages to
several other listener areas. [32]
•
PDA’s, Display Screens or Head-Phones:
Naturally, output translation can also be delivered
through traditional display technology, i.e.,
4.
Acknowledgements
The work presented here was supported in part by the
European Union (EU) (projects CHIL (Grant number IST506909) and TC-STAR (Grant number IST-506738)), by NSF
(ITR STR-DUST), by DARPA (projects TRANSTAC and
GALE). I would also like to thank the CHIL, TC-STAR,
GALE, TRANSTAC partners and the InterACT research team
at Karlsruhe and Pittsburgh for their collaboration and for
data and images reported in this paper. Any opinions, findings,
conclusions or recommendations expressed in this paper are
those of the author and do not necessarily reflect the views of
the funding agencies or the partners.
5.
References
[1] R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garafolo,
D. Mostefa, P. Soundararajan, “The CLEAR 2006
Evaluation”, Proceedings of the First International
CLEAR Evaluation, Springer LNCS 4122.
[2] J. Fiscus, J. Ajot, M. Michel, and J. Garofolo, “The rich
transcription 2006 spring meeting recognition
evaluation,’’ Proc. MLMI, Washington DC, 2006.
20
[3] C. Canton-Ferrer, J. R. Casas, M. Pardàs, “Human
Model and Motion Based 3D Action Recognition in
Multiple View Scenarios”. EUSIPCO, Firenze,
September 2006
[4] O. Lanz, “Approximate Bayesian Multibody Tracking”.
IEEE Trans. PAMI, vol. 28, no. 9, September 2006
[5] R. Stiefelhagen, K. Bernardin, H.K. Ekenel, J.
McDonough, K. Nickel, M. Voit, M. Wölfel, “AudioVisual Perception of a Lecturer in a Smart Seminar
Room”. Signal Processing, Vol. 86 (12), December
2006, Elsevier.
[6] M. Wölfel, K. Nickel, and J. McDonough. “Microphone
array driven speech recognition: Influence of
localization on the word error rate”, Proc. of MLMI,
Edinburgh, UK, 2005.
[7] H. K. Maganti and D. Gatica-Perez “Speaker
Localization for Microphone Array-Based ASR: The
Effects of Accuracy on Overlapping Speech”, ICMI,
Banff, Canada, Nov. 2006.
[8] C. Wojek, K. Nickel, R. Stiefelhagen, “Activity
Recognition and Room-Level Tracking in an Office
Environment”. Proc. of the IEEE Intl. Conference on
Multisensor Fusion and Integration for Intelligent
Systems, Heidelberg, Germany, 2006.
[9] R. Stiefelhagen, J. Yang, A. Waibel, “Modeling Focus of
Attention for Meeting Indexing”. ACM Multimedia,
Orlando, Florida, Oct. 1999
[10] M. Voit, R. Stiefelhagen, “Tracking Head Pose and
Focus of Attention with Multiple Far-field Cameras”.
ICMI, Banff, Canada, Nov. 2006.
[11] CHIL – Computers in the Human Interaction Loop,
http://chil.server.de
[12] VACE – Video Analysis and Content Extraction,
http://www.ic-arda.org
[13] TRECVID – TREC Video Retrieval Evaluation,
http://www-nlpir.nist.gov/projects/t01v/
[14] PETS – Performance Evaluation of Tracking and
Surveillance, http://www.pets2006.net/
[15] ETISEO
–
Video
Understanding
Evaluation,
http://www.silogic.fr/etiseo
[16] “D2.2 Functional Requirements & CHIL Cooperative
Information System Software Design, Part 2,
Cooperative Information System Software Design”,
http://chil.server.de
[17] A. Waibel, M. Bett, M. Finke, and R. Stiefelhagen,
“Meeting browser: Tracking and summarizing
meetings”. In Proceedings of the Broadcast News
Transcription and Understanding Workshop, pp. 281286, Lansdowne, Virginia, 1998.
[18] M-M. Bouamrane and S. Luz, “Meeting browsing”,
Multimedia Systems, Springer-Verlag, 12 (4-5):439-457,
2006.
[19] Q. Y. Wang, A. Battocchi, I. Graziola, F. Pianesi, D.
Tomasini, M. Zancanaro, C. Nass. “The Role of
Psychological Ownership and Ownership Markers in
Collaborative Working Environment”. ICMI. Banff,
Canada, 2006
[20] M. Danninger, T. Kluge, R. Stiefelhagen, “MyConnector
– Analysis of Context Cues to Predict Human
Availability for Communication”. ICMI, Banff, Canada,
2006.
[21] J. Neumann, J. R. Casas, D. Macho, J. Ruiz,
“Multimodal Integration of Sensor Networks”. Proc. of
AIAI, pp. 312-323, Athens, Greece, 2006.
[22] A. Waibel, A. N. Jain, A. E. McNair, H. Saito, A. G.
Hauptmann, J. Tebelskis. “JANUS: A Speech-to-speech
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
21
Translation Using Connectionist and Symbolic
Processing Strategies.” In Proc. of ICASSP'91, pages
793-796, May 1991.
T. Morimoto, T. Takezawa, F. Yato, S. Sagayama, T.
Tashiro, M. Nagata, and A. Kurematsu, "ATR's speech
translation system: ASURA," Proc. 3rd European Conf.
on Speech Communication and Technology, pp. 12911294, Sep. 1993.
R. Hsiao, A. Venugopal, T. Köhler, Y. Zhang, P.
Charoenpornsawat, A. Zollmann, S. Vogel, A. W. Black,
T. Schultz, A. Waibel, "Optimizing Components for
Handheld Two-way Speech Translation for English-Iraqi
Arabic System", Proceedings of Interspeech, 2006
GALE – Global Autonomous Language Exploitation
http://www.darpa.mil/ipto/programs/gale
J. L. Gauvain "Speech transcription: general presentation
of existing technologies within TC-Star”. TC-Star
Review Workshop, Luxembourg, May 28-30, 2007
H. Ney "TC-Star: Statistical MT of Text and Speech".
TC-Star Review Workshop, Luxembourg, May 28-30,
2007
K. Choukri “Importance of the Evaluation of HumanLanguage Technologies”. TC-Star Review Workshop,
Luxembourg, May 28-30, 2007
M. Kolss, B. Zhao, S. Vogel, A. Hildebrand, J. Niehues,
A. Venugopal, and Y. Zhang. “The ISL Statistical
Machine Translation System for the TC-STAR Spring
2006 Evaluation” In Proc. of the TC-STAR Workshop
on Speech-to-Speech Translation, Barcelona, Spain, June
2006.
C. Fügen, M. Kolss, M. Paulik, A. Waibel: “Open
Domain Speech Translation: From Seminars and
Speeches to Lectures”, In Proc. of the TC-STAR
Workshop on Speech-to-Speech Translation, Barcelona,
Spain, 2006.
J. Fiscus and J. Ajot “The Rich Transcription 2007
Speech-To-Text (STT) and Speaker Attributed STT
(SASTT) Results”, The Rich Transcription 2007
Meeting Recognition
D. Olszewski, F. Prasetyo, and K. Linhard, “Steerable
Highly Directional Audio Beam Louspeaker”. In Proc.
of the Interspeech, Lisboa, Portugal, September 2006
Tanja Schultz, “Multilinguale Spracherkennung Kombination akustischer Modelle zur Portierung auf
neue Sprachen”. PhD thesis, Universität Karlsruhe, June
2000
M. Eck, S. Vogel, A. Waibel, “Low Cost Portability for
Statistical Machine Translation based on N-gram
Frequency and TF-IDF”. Proc. of IWSLT, Pittsburgh,
PA, Oct 2005
M. Gavalda, A. Waibel, “Growing semantic grammars”.
In Proceedings of the COLING/ACL, Montreal, Canada,
1998.
M. Paulik, S. Stüker, C. Fügen, T. Schultz, T. Schaaf, A.
Waibel, “Speech Translation Enhanced Automatic
Speech Recognition”. ASRU, Cancun, Mexico,
December 2005