This is a third-party implementation of CVPR 2024 paper The Audio-Visual Conversational Graph: from an Egocentric-Exocentric Perspective.
[Paper] [Supplement] [Project Page and Demo]
To set up the avconv
conda environment with all required packages, run:
conda env create -f avconv.yaml
conda activate avconv
-
Dataset Download & Availability
Please note that this repository does not provide any dataset download links.
The original dataset used in the paper has not yet been publicly released by Meta.
This page will be updated with the official source once it becomes available. Alternatively, you may collect your own multi-modal conversational data following the dataset description in the paper. -
Directory Structure
Once you obtain the dataset, organize it as follows:-
Audio-Visual Data
../data/av_data/{session_number}/image_{frame_number}.jpg ../data/av_data/{session_number}/a1_{frame_number}.mat
-
Ground-Truth Label Files
../data/av_label/{session_number}.json
-
-
Update Paths in Parameter Files
In both training and evaluation configs (params_train.json
,params_test.json
), make sure the following fields are set correctly:data_path: "../data/av_data" label_path: "../data/av_label"
- Parameter file:
./params/params_train.json
- Required paths:
data_path
,label_path
,log_path
- checkpoints and tensorboard logs are saved under
log_path
To start training:
python train_net.py
- Parameter file:
./params/params_test.json
- Required paths:
data_path
,label_path
,checkpoint_path
,out_path
- Set
checkpoint_path
to the desired checkpoint to evaluate - Set
out_path
to specify where to save the output predictions inpreds.pkl
files
To run evaluation:
python test_net.py
- Predictions will be saved at:
./output/{ckpt_log}_inference/preds.pkl
This implementation does not contain any proprietary code, internal tools, or unpublished resources from Meta. All components—including architecture, data loaders, and configurations—were reproduced independently for academic and community research purposes.
If you find our work is useful for your research, please cite:
@inproceedings{jia2024audio,
title={The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective},
author={Jia, Wenqi and Liu, Miao and Jiang, Hao and Ananthabhotla, Ishwarya and Rehg, James M and Ithapu, Vamsi Krishna and Gao, Ruohan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}