Frame-by-Frame Multi-object Tracking Guided Information Augmentation For Video Captioning
In this work, we propose a new video captioning method based on transformer,our method aims to generate captions more efficiently and quickly without relying on complex architectures or additional data modalities.
✨ The framework will be shown later~
By releasing this code, we hope to stimulate further research and development in lightweight video captioning. If you find this work useful in your own research, please consider citing our paper as a reference.
Clone and enter the repo:
git clone https://github.com/ccc000-png/Tracker4Cap.git
cd Tracker4Cap
We has refactored the code and tested it on:
Python
3.9torch
1.13.1
Please change the version of torch and cuda according to your hardwares.
conda create -n Tracker4Cap python==3.9
conda activate Tracker4Cap
# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
1. Pre-trained CLIP (please refer README_PRETRAINED.md)
The features of our model are extracted using pre-trained CLIP. To avoid network issues, we recommend that you download the pre-trained model in advance and put it in the /model_zoo/clip_model
folder.
2. Supported datasets
MSVD
MSRVTT
You can download our preprocessed data from One Drive, which follows the structure below:
└── data
├── msrvtt
│ ├── language
│ │ └── msrvtt_caption.json
│ ├── splits
│ │ ├── msrvtt_test_list.pkl
│ │ ├── msrvtt_train_list.pkl
│ │ └── msrvtt_valid_list.pkl
│ ├── ...
│ └── visual
│ ├── clip_b16
│ │ └── frame_feature
│ │ ├── ...
│ │ └── xxx.npy
│ ├── clip_b32
│ │ └── frame_feature
│ │ ├── ...
│ │ └── xxx.npy
│ └── clip_l14
│ └── frame_feature
│ ├── ...
│ └── xxx.npy
├── msvd
│ ├── language
│ │ └── msrvtt_caption.json
│ ├── splits
│ │ ├── msrvtt_test_list.pkl
│ │ ├── msrvtt_train_list.pkl
│ │ └── msrvtt_valid_list.pkl
│ ├── ...
│ └── visual
│ ├── clip_b16
│ │ └── frame_feature
│ │ ├── ...
│ │ └── xxx.npy
│ ├── clip_b32
│ │ └── frame_feature
│ │ ├── ...
│ │ └── xxx.npy
│ └── clip_l14
│ └── frame_feature
│ ├── ...
│ └── xxx.npy
└── ...
You can also download raw videos for feature processing from our shared links. Please organize them as follows:
Datasets | Official Link |
---|---|
MSVD | Link |
MSRVTT | Link (expired) |
└── data
├── msrvtt
│ └── raw_videos
│ ├── video0.avi
│ ├── ...
│ └── video9999.avi
├── msvd
│ └── raw_videos
│ ├── video0.avi
│ ├── ...
│ └── video1969.avi
└── ...
Note:
- The original names of MSVD videos do not follow the format
videoXXX
, It is recommended that you use the official name in the dataset directly. And you can follow README_DATA.md to process data.
You can download pretrained model and use them to evaluating the experiments by using the following commands:
- If you run the experiments on
MSRVTT
:
# msrvtt
python test.py --dataset msrvtt --track_objects 4 --clip_name clip_l14 --Track --Age --fusion_action --save_checkpoints [Pretrained model on MSRVTT]
- If you run the experiments on
MSVD
:
# msvd
python test.py --dataset msvd --track_objects 3 --clip_name clip_l14 --Track --Age --fusion_action --save_checkpoints [Pretrained model on MSVD]
You can use the following commands to run the experiments:
- If you run the experiments on
MSRVTT
:
# msrvtt
python main.py --dataset msrvtt --track_objects 4 --clip_name clip_l14 --Track --Age --fusion_action
- If you run the experiments on
MSVD
:
# msvd
python main.py --dataset msvd --track_objects 3 --clip_name clip_l14 --Track --Age --fusion_action