Library Initiatives
Informedia Digital Video Library
M. Christel, T. Kanade, M. Mauldin, R. Reddy, M. Sirbu,
S. Stevens, and H. Wactlar
http://fuzine.mt.cs.cmu.edu/im/informedia.html
T
he Informedia Digital Video Library Project speech recognizer developed at Carnegie Mellon.
is developing new technologies for creating With recent advances in acoustic and language modfull-content search and retrieval digital
eling, it has achieved a 90% success rate on standardvideo libraries. Working in collaboration
ized tests for a 20,000-word, general dictation task. By
with WQED Pittsburgh, the project is creating a testrelaxing time constraints and allowing transcripts to
bed that will enable K–12 students to access,
explore, and retrieve science and mathematics materials from the digital video library.
The library will initially contain 1,000 hours of
video from the archives of project partners:
WQED, Fairfax Co. VA Schools’ Electronic
Field Trips, and the British Open University’s
BBC-produced video courses. (Industrial partners include Digital Equipment Corp., Bell
Atlantic, Intel Corp., and Microsoft, Inc.) This
library will be installed at Winchester
Thurston School, an independent K–12
school in Pittsburgh.
One of the most interesting research
aspects of the project is the development of
automatic, intelligent mechanisms to populate the library through integrated speech,
image, and language understanding. The
Informedia digital video library system uses
Sphinx-II to transcribe narratives and dialogues automatically. Sphinx-II is a largeFigure 1. Combining speech, image, and natural language to
vocabulary, speaker-independent, continuous
create a full-context, searchable library
COMMUNICATIONS OF THE ACM
April 1995/Vol. 38, No. 4
57
Figure 2. A prototype system display
be generated off-line, Sphinx-II will be adapted to
handle the video library domain’s larger vocabulary
and diverse audio sources without severely degrading
recognition rates. In addition to annotating the
video library with text transcripts, the videos will be
segmented into smaller subsets for faster access and
retrieval of relevant information. Some segmentation
is possible via the time-based transcript generated
from the audio information.
Segmenting video clips via visual content is also
being performed based on work at CMU’s Image
Understanding Systems Laboratory. Rather than manually reviewing a file frame-by-frame around an index
entry point, machine vision methods that interpret
image sequences are used to automatically locate
beginning and end points for a scene or conversation. This segmentation process can be improved
through the use of contextual information supplied
by the transcript and language understanding.
Finding desired items in a large information base
poses a major challenge. The Informedia Project goes
beyond simply searching the transcript text and is, in
addition, applying natural-language understanding
for knowledge-based search and retrieval. One strategy employs computational linguistic techniques from
58
April 1995/Vol. 38, No. 4
COMMUNICATIONS OF THE ACM
the Center for Machine Translation for indexing,
browsing, and retrieving based on identification of
noun phrases in text. Along with improving query
capabilities, the Informedia Project is researching better ways to present information from a given video
library. Once users identify video objects of interest
they will need to be able to manipulate, organize, and
effectively reuse the video. To aid the user, the system
will use cinematic knowledge to enhance the composition and reuse of materials from the video library.
The Informedia Project’s first version drew on a
small (three gigabyte) database of several hundred
digital video objects, text, graphics, and audio material drawn from WQEDs Space Age series, distinguished lectures in computer science, and software
engineering training lectures. Early user feedback
has shown the benefits of automatic indexing and
segmentation, illustrating the accurate search and
selective retrieval of audio and video materials
appropriate to users’ needs and desires. The system
demonstrates the practicality of combining speech,
language, and image understanding technologies to
create entertaining, educational experiences. C
© ACM 0002-0782/95/0400