Online Demos

Visual Grounding (Referring Expression Comprehension) section has the steps to run the evaluation file. OFA.ipynb file has the code to clone the repo and icludes all the steps I ran to replicate the results.

NOTE : Read.md from the original repo is edited as below for better understanding

ModelScope ｜ Checkpoints ｜ Colab ｜ Demo ｜ Paper ｜ Blog

OFA is a unified sequence-to-sequence pretrained model (support English and Chinese) that unifies modalities (i.e., cross-modality, vision, language) and tasks (finetuning and prompt tuning are supported): image captioning (1st at the MSCOCO Leaderboard), VQA (link), visual grounding, text-to-image generation, text classification, text generation, image classification, etc. We provide step-by-step instructions for pretraining and finetuning and corresponding checkpoints (check official ckpt [EN|CN] or huggingface ckpt).

We sincerely welcome contributions to our project. Feel free to contact us or send us issues / PRs!

Online Demos

We provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:

Image Captioning [ModelScope | Spaces]
Visual Grounding [ModelScope | Spaces]
Visual Question Answering [ModelScope | Spaces]
Text-to-Image Generation [ModelScope | Spaces]
Generic Interface [Spaces]

Also we provide Colab notebooks for you to better perceive the procedures. Click here to check them out!

Requirements

python 3.7.4
pytorch 1.8.1
torchvision 0.9.1
JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Training & Inference

Below we provide methods for training and inference on different tasks. We provide both pretrained OFA-Large and OFA-Base in checkpoints.md. The scripts mentioned in this section are prepared for OFA-Large. For reproducing the downstreaming results of OFA-Base, we have also provided the corresponding finetuning and inference scripts for OFA-Base in the run_scripts/ folder.

We recommend that your workspace directory should be organized like this:

OFA/
├── checkpoints/
│   ├── ofa_base.pt
│   ├── ofa_large.pt
│   ├── caption_large_best_clean.pt
│   └── ...
├── criterions/
├── data/
├── dataset/
│   ├── caption_data/
│   ├── gigaword_data/
│   └── ...
├── fairseq/
├── models/
├── run_scripts/
├── tasks/
├── train.py
├── trainer.py
└── utils/

Image Processing

To ensure the efficiency of processing data, we did not store images with small files, but instead we encode them to base64 strings. Transforming image files to base64 strings is simple. Run the following code:

from PIL import Image
from io import BytesIO
import base64

img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str

Pretraining

Below we provide methods for pretraining OFA.

1. Prepare the Dataset

To pretrain OFA, you should first download the dataset we provide (pretrain_data_examples.zip, a small subset of the original pretraining data). For your customed pretraining datasets, please prepare your training samples into the same format. pretrain_data_examples.zip contains 4 TSV files: vision_language_examples.tsv, text_examples.tsv, image_examples.tsv and detection_examples.tsv. Details of these files are as follows:

vision_language_examples.tsv: Each line contains uniq-id, image (base64 string), caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering.
text_examples.tsv: Each line contains uniq-id and text. Prepared for the pretraining task of text infilling.
image_examples.tsv: Each line contains uniq-id, image (base64 string, should be resized to 256*256 resolution) and image-code (generate the sparse codes for the central part of image through VQ-GAN). Prepared for the pretraining task of image infilling.
detection_examples.tsv: Each line contains uniq-id, image (base64 string) and bounding box annotations (contains the top-left and bottom-right coordinates of the bounding box, object_id and object_name, seperated by commas). Prepared for the pretraining task of detection.

In addition, the folder negative_sample in pretrain_data_examples.zip contains three files all_captions.txt, object.txt and type2ans.json. The data in these files are used as negative samples for the image-text matching (ITM) task.

2. Pretraining

By default, the pretraining script will attempt to restore the released pretrained checkpoints of OFA-Base or OFA-Large and perform continuous pretraining. Continuous pretraining is more recommended, which achieves much better results compared with pretraining from scratch. For continuous pretraining, please download the pretrained weights in advance (see checkpoints.md) and put them in the correct directory OFA/checkpoints/. If not, the pretraining will begin from scratch.

cd run_scripts/pretraining
bash pretrain_ofa_large.sh # Pretrain OFA-Large. For OFA-Base, use pretrain_ofa_base.sh

If the pretrained OFA checkpoint is restored successfully, you will see the following information in the log:

INFO: Loaded checkpoint ../../checkpoints/ofa_large.pt

Visual Grounding (Referring Expression Comprehension)

Here provides procedures for you to prepare data, train, and evaluate your model on visual grounding.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. We provide RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.

79_1    237367  A woman in a white blouse holding a glass of wine.  230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=

2. Finetuning

Unlike the original paper, we finetune OFA with a drop-path rate of 0.2, and found that training with this hyper-parameter achieves better results. We will update the reported results of the paper later.

cd run_scripts/refcoco
nohup sh train_refcoco.sh > train_refcoco.out &  # finetune for refcoco
nohup sh train_refcocoplus.sh > train_refcocoplus.out &  # finetune for refcoco+
nohup sh train_refcocog.sh > train_refcocog.out &  # finetune for refcocog

3. Inference

Run the following commands for the evaluation.

cd run_scripts/refcoco ; sh evaluate_refcoco.sh  # inference & evaluate for refcoco/refcoco+/refcocog

Name		Name	Last commit message	Last commit date
Latest commit History 684 Commits
criterions		criterions
data		data
examples		examples
fairseq		fairseq
models		models
ofa_module		ofa_module
run_scripts		run_scripts
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
OFA.ipynb		OFA.ipynb
README.md		README.md
README_EncouragingLoss.md		README_EncouragingLoss.md
checkpoints.md		checkpoints.md
checkpoints_cn.md		checkpoints_cn.md
colab.md		colab.md
datasets.md		datasets.md
evaluate.py		evaluate.py
modelscope.md		modelscope.md
prompt_tuning.md		prompt_tuning.md
requirements.txt		requirements.txt
spaces.md		spaces.md
train.py		train.py
trainer.py		trainer.py
transformers.md		transformers.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online Demos

Requirements

Installation

Datasets and Checkpoints

Training & Inference

Image Processing

Pretraining

Visual Grounding (Referring Expression Comprehension)

About

Releases

Packages

Languages

License

sjc6752/OFA-1

Folders and files

Latest commit

History

Repository files navigation

Online Demos

Requirements

Installation

Datasets and Checkpoints

Training & Inference

Image Processing

Pretraining

Visual Grounding (Referring Expression Comprehension)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages