research-article

Probabilistic integration of sparse audio-visual cues for identity tracking

Authors:

Keni Bernardin,

Rainer Stiefelhagen,

Alex WaibelAuthors Info & Claims

MM '08: Proceedings of the 16th ACM international conference on Multimedia

Pages 151 - 158

https://doi.org/10.1145/1459359.1459380

Published: 26 October 2008 Publication History

Abstract

In the context of smart environments, the ability to track and identify persons is a key factor, determining the scope and flexibility of analytical components or intelligent services that can be provided. While some amount of work has been done concerning the camera-based tracking of multiple users in a variety of scenarios, technologies for acoustic and visual identification, such as face or voice ID, are unfortunately still subjected to severe limitations when distantly placed sensors have to be used. Because of this, reliable cues for identification can be hard to obtain without user cooperation, especially when multiple users are involved.

In this paper, we present a novel technique for the tracking and identification of multiple persons in a smart environment using distantly placed audio-visual sensors. The technique builds on the opportunistic integration of tracking as well as face and voice identification cues, gained from several cameras and microphones, whenever these cues can be captured with a sufficient degree of confidence. A probabilistic model is used to keep track of identified persons and update the belief in their identities whenever new observations can be made. The technique has been systematically evaluated on the CLEAR Interactive Seminar database, a large audio-visual corpus of realistic meeting scenarios captured in a variety of smart rooms.

References

[1]

I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, D. Zhang, "Automatic Analysis of Multimodal Group Actions in Meetings". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 3, pp. 305--317, March, 2005.

Digital Library

[2]

R. Stiefelhagen, "Tracking Focus of Attention in Meetings". IEEE Int. Conf. on Multimodal Interfaces - ICMI 2002, Pittsburgh, 2002.

Digital Library

[3]

T. Choudhury, B. Clarkson, T. Jebara and A. Pentland, "Multimodal Person Recognition using Unconstrained Audio and Video". Second Conference on Audio- and Video-based Biometric Person Authentication '99 (AVBPA '99), pp. 176--181, Washington DC

[4]

J. Yang, X. Zhu, R. Gross, J. Kominek, Y. Pan, A. Waibel, "Multimodal people ID for a multimedia meeting browser". Proceedings of the 7th ACM International Conference on Multimedia '99, Orlando, FL

Digital Library

[5]

A. Hampapur, S. Pankanti, A. W. Senior, Y.-L. Tian, L. Brown, R. M. Bolle, "Face Cataloger: Multi-Scale Imaging for Relating Identity to Location". IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS 2003), July 2003, Miami, FL.

Digital Library

[6]

S. Stillman, R. Tanawongsuwan, and I. Essa, "A system for tracking and recognizing multiple people with multiple cameras". Technical Report GIT-GVU-98-25, Georgia Inst. of Tech., Graphics, Visualization, and Usability Center, 1998.

[7]

S. Stillman and I. Essa, "Towards reliable multimodal sensing in aware environments" Perceptual User Interfaces (PUI) Workshop, 2001.

Digital Library

[8]

M. Trivedi, I. Mikic and S. Bhonsle, "Active Camera Networks and Semantic Event Databases for Intelligent Environments". IEEE Workshop on Human Modeling, Analysis and Synthesis, June 2000.

[9]

Dimitrios Makris, Tim Ellis, James Black, "Bridging the Gaps between Cameras". IEEE Conference on Computer Vision and Pattern Recognition (CVPR'04) - Vol. 2, 2004

Digital Library

[10]

W. Zajdel, B. J. A. Kröse, "A sequential Bayesian algorithm for surveillance with nonoverlapping cameras". IJPRAI, Vol. 19, No. 8, Dec 2005

[11]

T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough, "Kalman Filters for Audio-Video Source Localization". IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October. 2005.

[12]

H. K. Ekenel, R. Stiefelhagen, "Local Appearance based Face Recognition Using Discrete Cosine Transform". 13th European Signal Processing Conference (EUSIPCO), Antalya Turkey, September 2005.

[13]

H. K. Ekenel, R. Stiefelhagen, "A Generic Face Representation Approach for Local Appearance based Face Verification". CVPR IEEE Workshop on Face Recognition Grand Challenge Experiments, San Diego, CA, USA, June 2005.

Digital Library

[14]

H. K. Ekenel, Q. Jin, "ISL Person Identification Systems in the CLEAR Evaluations". Proceedings of the first International CLEAR evaluation workshop, Southampton, UK, April 2006.

Digital Library

[15]

K. Bernardin, A. Elbs, R. Stiefelhagen, "Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment". 6th IEEE Int. Workshop on Visual Surveillance, VS 2006, Graz, Austria, May 2006

[16]

K. Bernardin and R. Stiefelhagen, "Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics". EURASIP Journal on Image and Video Processing, Special Issue on Video Tracking in Complex Scenes for Surveillance Applications, Vol. 2008, Article ID 246309, May 2008

Digital Library

[17]

K. Bernardin and R. Stiefelhagen, "Audio-Visual Multi-Person Tracking and Identification for Smart Environments". ACM Multimedia 2007, Augsburg, Germany, September 2007

Digital Library

[18]

K. Bernardin, T. Gehrig, R. Stiefelhagen, "Multi-Level Particle Filter Fusion of Features and Cues for Audio-Visual Person Tracking". Multimodal Technologies for Perception of Humans, Joint Proceedings of the CLEAR 2007 and RT 2007 Evaluation Workshops, May 2007, Baltimore, MD, USA, Springer LNCS 4625, 2008

Digital Library

[19]

R. Stiefelhagen, K. Bernardin, R. Bowers, T. Rose, M. Michel and J. Garofolo, "The CLEAR 2007 Evaluation". Multimodal Technologies for Perception of Humans, Joint Proceedings of the CLEAR 2007 and RT 2007 Evaluation Workshops, May 2007, Baltimore, MD, USA, Springer LNCS 4625, 2008

Digital Library

[20]

D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. M. Chu, A. Tyagi, J. R. Casas, J. Turmo, L. Christoforetti, F. Tobia, A. Pnevmatikakis, V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelhagen, K. Bernardin, and C. Rochet, "The CHIL Audiovisual Corpus for Lecture and Meeting Analysis Inside Smart Rooms". In Language Resources and Evaluation, No. 41, Springer, 2007.

[21]

Rainer Stiefelhagen, Jonathan Fiscus and Rachel Bowers, "Multimodal Technologies for Perception of Humans, Joint Proceedings of the CLEAR 2007 and RT 2007 Evaluation Workshops". Springer Lecture Notes in Computer Science, No. 4625, 2008.

[22]

J. Munkres, "Algorithms for the Assignment and Transportation Problems". Journal of the Society of Industrial and Applied Mathematics, Vol. 5(1), pp. 32--38, March 1957.

[23]

T. Kailath, "The Divergence and Bhattacharyya Distance Measures in Signal Selection". IEEE Trans. on Comm. Technology, Vol. 15, pp. 52--60, Feb. 1967

[24]

CLEAR - Classification of Events, Activities and Relationships, http://www.clear-evaluation.org/

Cited By

Hu JLee MYang C(2018)An embedded audio-visual tracking and speech purification system on a dual-core processor platformMicroprocessors & Microsystems10.1016/j.micpro.2010.05.00434:7-8(274-284)Online publication date: 28-Dec-2018
https://dl.acm.org/doi/10.1016/j.micpro.2010.05.004
Lu TWang GSu F(2018)Context-based environmental audio event recognition for scene understandingMultimedia Systems10.1007/s00530-014-0424-721:5(507-524)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1007/s00530-014-0424-7
(2015)ReferencesSimilarity Measures for Face Recognition10.2174/9781681080444115010014(99-106)Online publication date: 26-Apr-2015
https://doi.org/10.2174/9781681080444115010014
Show More Cited By

Index Terms

Probabilistic integration of sparse audio-visual cues for identity tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding

Recommendations

Joint audio-visual tracking using particle filters

It is often advantageous to track objects in a scene using multimodal information when such information is available. We use audio as a complementary modality to video data, which, in comparison to vision, can provide faster localization over a wider ...
Occlusion cues for image scene layering

To bring computer vision closer to human vision, we attempt to enable computer to understand the occlusion relationship in an image. In this paper, we propose five low dimensional region-based occlusion cues inspired by the human perception of ...
Tracking Multiple Occluding People by Localizing on Multiple Scene Planes

Occlusion and lack of visibility in crowded and cluttered scenes make it difficult to track individual people correctly and consistently, particularly in a single view. We present a multi-view approach to solving this problem. In our approach we neither ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '08: Proceedings of the 16th ACM international conference on Multimedia

October 2008

1206 pages

ISBN:9781605583037

DOI:10.1145/1459359

General Chairs:
Abdulmotaleb EL Saddik
University of Ottawa
,
Son Vuong
University of British Colombia
,
Program Chairs:
Carsten Griwodz
University of Oslo
,
Alberto Del Bimbo
University degli Studi di Firenze
,
K. Selcuk Candan
Arizona State University
,
Alejandro Jaimes
Telefonica R&D, Madrid, Spain

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM08

Sponsor:

MM08: ACM Multimedia Conference 2008

October 26 - 31, 2008

British Columbia, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
233
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu JLee MYang C(2018)An embedded audio-visual tracking and speech purification system on a dual-core processor platformMicroprocessors & Microsystems10.1016/j.micpro.2010.05.00434:7-8(274-284)Online publication date: 28-Dec-2018
https://dl.acm.org/doi/10.1016/j.micpro.2010.05.004
Lu TWang GSu F(2018)Context-based environmental audio event recognition for scene understandingMultimedia Systems10.1007/s00530-014-0424-721:5(507-524)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1007/s00530-014-0424-7
(2015)ReferencesSimilarity Measures for Face Recognition10.2174/9781681080444115010014(99-106)Online publication date: 26-Apr-2015
https://doi.org/10.2174/9781681080444115010014
Menon VJayaraman BGovindaraju V(2011)Spatio-Temporal Reasoning in Biometrics Based Smart EnvironmentsProcedia Computer Science10.1016/j.procs.2011.07.0495(378-385)Online publication date: 2011
https://doi.org/10.1016/j.procs.2011.07.049
Shivappa SRao BTrivedi M(2010)Audio-Visual Fusion and Tracking With Multilevel Iterative Decoding: Framework and Experimental EvaluationIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2010.20578904:5(882-894)Online publication date: Oct-2010
https://doi.org/10.1109/JSTSP.2010.2057890
Shivappa STrivedi MRao B(2010)Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A SurveyProceedings of the IEEE10.1109/JPROC.2010.205723198:10(1692-1715)Online publication date: Oct-2010
https://doi.org/10.1109/JPROC.2010.2057231
Menon VJayaraman BGovindaraju V(2010)Multimodal identification and tracking in smart environmentsPersonal and Ubiquitous Computing10.1007/s00779-010-0288-614:8(685-694)Online publication date: 1-Dec-2010
https://dl.acm.org/doi/10.1007/s00779-010-0288-6
Shivappa STrivedi MRao B(2009)Hierarchical audio-visual cue integration framework for activity analysis in intelligent meeting rooms2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops10.1109/CVPRW.2009.5204224(107-114)Online publication date: Jun-2009
https://doi.org/10.1109/CVPRW.2009.5204224

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents