Continuous Training with AutoMM¶

Open In Colab Open In SageMaker Studio Lab

Continuous training provides a method for machine learning models to refine their performance over time. It enables models to build upon previously acquired knowledge, thereby enhancing accuracy, facilitating knowledge transfer across tasks, and saving computational resources. In this tutorial, we will demonstrate three use cases of continuous training with AutoMM.

Use Case 1: Expanding Training with Additional Data or Training Time¶

Sometimes, the model could benefit from more training epochs or additional training time in case of underfitting. With AutoMM, you can easily extend the training time of your model without starting from scratch.

Additionally, it’s also common to need to incorporate more data into your model. AutoMM allows you to continue training with data of the same problem type and same classes if it is a multiclass problem. This flexibility makes it easy to improve and adapt your models as your data grows.

We use Stanford Sentiment Treebank (SST) dataset as an example. It consists of movie reviews and their associated sentiment. Given a new movie review, the goal is to predict the sentiment reflected in the text (in this case a binary classification, where reviews are labeled as 1 if they convey a positive opinion and labeled as 0 otherwise). Let’s first load and look at the data, noting the labels are stored in a column called label.

from autogluon.core.utils.loaders import load_pd

train_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet")
test_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet")
subsample_size = 1000  # subsample data for faster demo, try setting this to larger values
train_data_1 = train_data.sample(n=subsample_size, random_state=0)
train_data_1.head(10)
sentence label
43787 very pleasing at its best moments 1
16159 , american chai is enough to make you put away... 0
59015 too much like an infomercial for ram dass 's l... 0
5108 a stirring visual sequence 1
67052 cool visual backmasking 1
35938 hard ground 0
49879 the striking , quietly vulnerable personality ... 1
51591 pan nalin 's exposition is beautiful and myste... 1
56780 wonderfully loopy 1
28518 most beautiful , evocative 1

Now let’s train the model. To ensure this tutorial runs quickly, we simply call fit() with a subset of 1000 training examples and limit its runtime to approximately 1 minute. To achieve reasonable performance in your applications, you are recommended to set much longer time_limit (eg. 1 hour), or do not specify time_limit at all (time_limit=None).

from autogluon.multimodal import MultiModalPredictor
import uuid

model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
predictor = MultiModalPredictor(label="label", eval_metric="acc", path=model_path)
predictor.fit(train_data_1, time_limit=60)
=================== System Info ===================
AutoGluon Version:  1.1.1b20240613
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri May 17 18:07:48 UTC 2024
CPU Count:          8
Pytorch Version:    2.3.1+cu121
CUDA Version:       12.1
Memory Avail:       28.65 GB / 30.95 GB (92.6%)
Disk Space Avail:   191.92 GB / 255.99 GB (75.0%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1, 0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst
    ```

Seed set to 0
/home/ci/opt/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.42GB/15.0GB (Used/Total)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M  | train
1 | validation_metric | MulticlassAccuracy           | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)
Epoch 0, global step 3: 'val_acc' reached 0.54500 (best 0.54500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/epoch=0-step=3.ckpt' as top 3
Epoch 0, global step 7: 'val_acc' reached 0.62500 (best 0.62500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/epoch=0-step=7.ckpt' as top 3
Epoch 1, global step 10: 'val_acc' reached 0.70500 (best 0.70500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/epoch=1-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_acc' reached 0.84000 (best 0.84000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/epoch=1-step=14.ckpt' as top 3
Epoch 2, global step 17: 'val_acc' reached 0.86000 (best 0.86000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/epoch=2-step=17.ckpt' as top 3
Epoch 2, global step 21: 'val_acc' reached 0.79000 (best 0.86000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/epoch=2-step=21.ckpt' as top 3
Time limit reached. Elapsed time is 0:01:02. Signaling Trainer to stop.
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fed4827d750>

After training, we can evaluate our predictor on separate test data formatted similarly to our training data:

test_score = predictor.evaluate(test_data)
print(test_score)
{'acc': 0.8612385321100917}

If the training was completed successfully, model.ckpt can be found under model_path. If you think the model still underfits, you can continue training from this checkpoint by just running another .fit() with the same data. If you have some new data to add in and don’t want to train from scratch, you can also run .fit() with the new combined dataset.

predictor_2 = MultiModalPredictor.load(model_path)  # you can also use the `predictor` we assigned above
train_data_2 = train_data.drop(train_data_1.index).sample(n=subsample_size, random_state=0)
predictor_2.fit(train_data_2, time_limit=60)
Load pretrained checkpoint: /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/fcfbe6c178e54ee0918d54fade86e1c6-automm_sst/model.ckpt
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20240614_001413"
=================== System Info ===================
AutoGluon Version:  1.1.1b20240613
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri May 17 18:07:48 UTC 2024
CPU Count:          8
Pytorch Version:    2.3.1+cu121
CUDA Version:       12.1
Memory Avail:       25.31 GB / 30.95 GB (81.8%)
Disk Space Avail:   191.07 GB / 255.99 GB (74.6%)
===================================================

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413
    ```

Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.57GB/15.0GB (Used/Total)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M  | train
1 | validation_metric | MulticlassAccuracy           | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)
Epoch 0, global step 3: 'val_acc' reached 0.87000 (best 0.87000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=0-step=3.ckpt' as top 3
Epoch 0, global step 7: 'val_acc' reached 0.89500 (best 0.89500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=0-step=7.ckpt' as top 3
Epoch 1, global step 10: 'val_acc' reached 0.88500 (best 0.89500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=1-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_acc' reached 0.90000 (best 0.90000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=1-step=14.ckpt' as top 3
Epoch 2, global step 17: 'val_acc' reached 0.91500 (best 0.91500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=2-step=17.ckpt' as top 3
Epoch 2, global step 21: 'val_acc' reached 0.93000 (best 0.93000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=2-step=21.ckpt' as top 3
Epoch 3, global step 24: 'val_acc' reached 0.91500 (best 0.93000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413/epoch=3-step=24.ckpt' as top 3
Time limit reached. Elapsed time is 0:01:09. Signaling Trainer to stop.
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20240614_001413")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fececa79720>
test_score_2 = predictor_2.evaluate(test_data)
print(test_score_2)
{'acc': 0.8922018348623854}

Use Case 2: Resuming Training from the Last Checkpoint¶

If your training process collapsed for some reason, AutoMM allows you to resume training right from where you left off. last.ckpt will be saved under model_path instead of model.ckpt. By resuming the training, you just have to call MultiModalPredictor.load() with resume option:

predictor_resume = MultiModalPredictor.load(path=model_path, resume=True)
predictor.fit(train_data, time_limit=60)

Use Case 3: Applying Pre-Trained Models to New Tasks¶

Often, you’ll encounter situations where a new task is related but not identical to a task you’ve previously trained a model for (e.g., training a more fine-grained sentiment analysis model, or adding more classes to your multiclass model). If you wish to leverage the knowledge that the model has already learned from the old data to help it learn the new task more quickly and effectively, AutoMM supports dumping your trained models into model weights and using them as foundation models:

dump_model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
predictor.dump_model(save_path=dump_model_path)
Model weights and tokenizer for hf_text are saved to ./tmp/b58c2f0d69c345f4a854cea571c31026-automm_sst/hf_text.
'./tmp/b58c2f0d69c345f4a854cea571c31026-automm_sst'

You can then load the weights of the trained model, and continue training / fine-tuning the model on the new data.

Here is an example that uses the binary text model we trained previously on a regression task. We use the Semantic Textual Similarity Benchmark dataset for illustration only, so you might want to apply this feature to more relevant datasets. In this data, the column named score contains numerical values (which we would like to predict) that are human-annotated similarity scores for each given pair of sentences.

sts_train_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet")[
    ["sentence1", "sentence2", "score"]
]
sts_test_data = load_pd.load("https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet")[
    ["sentence1", "sentence2", "score"]
]
sts_train_data.head(10)
Loaded data from: https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet | Columns = 4 / 4 | Rows = 5749 -> 5749
Loaded data from: https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet | Columns = 4 / 4 | Rows = 1500 -> 1500
sentence1 sentence2 score
0 A plane is taking off. An air plane is taking off. 5.00
1 A man is playing a large flute. A man is playing a flute. 3.80
2 A man is spreading shreded cheese on a pizza. A man is spreading shredded cheese on an uncoo... 3.80
3 Three men are playing chess. Two men are playing chess. 2.60
4 A man is playing the cello. A man seated is playing the cello. 4.25
5 Some men are fighting. Two men are fighting. 4.25
6 A man is smoking. A man is skating. 0.50
7 The man is playing the piano. The man is playing the guitar. 1.60
8 A man is playing on a guitar and singing. A woman is playing an acoustic guitar and sing... 2.20
9 A person is throwing a cat on to the ceiling. A person throws a cat on the ceiling. 5.00

To specify a custom model that you created, use hyperparameters option in .fit():

hyperparameters={
    "model.hf_text.checkpoint_name": dump_model_path
}
sts_model_path = f"./tmp/{uuid.uuid4().hex}-automm_sts"
predictor_sts = MultiModalPredictor(label="score", path=sts_model_path)
predictor_sts.fit(
    sts_train_data, hyperparameters={"model.hf_text.checkpoint_name": f"{dump_model_path}/hf_text"}, time_limit=30
)
=================== System Info ===================
AutoGluon Version:  1.1.1b20240613
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri May 17 18:07:48 UTC 2024
CPU Count:          8
Pytorch Version:    2.3.1+cu121
CUDA Version:       12.1
Memory Avail:       25.03 GB / 30.95 GB (80.9%)
Disk Space Avail:   190.25 GB / 255.99 GB (74.3%)
===================================================
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and label-values can't be converted to int).
	Label info (max, min, mean, stddev): (5.0, 0.0, 2.701, 1.4644)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/9a373568b72b4a4db59b2cca539f1330-automm_sts
    ```

Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.57GB/15.0GB (Used/Total)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M  | train
1 | validation_metric | MeanSquaredError             | 0      | train
2 | loss_func         | MSELoss                      | 0      | train
---------------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.570   Total estimated model params size (MB)
Epoch 0, global step 20: 'val_rmse' reached 0.59339 (best 0.59339), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/9a373568b72b4a4db59b2cca539f1330-automm_sts/epoch=0-step=20.ckpt' as top 3
Time limit reached. Elapsed time is 0:00:30. Signaling Trainer to stop.
Epoch 0, global step 37: 'val_rmse' reached 0.50949 (best 0.50949), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/9a373568b72b4a4db59b2cca539f1330-automm_sts/epoch=0-step=37.ckpt' as top 3
Start to fuse 2 checkpoints via the greedy soup algorithm.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/9a373568b72b4a4db59b2cca539f1330-automm_sts")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fed4827ebc0>
test_score = predictor_sts.evaluate(sts_test_data, metrics=["rmse", "pearsonr", "spearmanr"])
print("RMSE = {:.2f}".format(test_score["rmse"]))
print("PEARSONR = {:.4f}".format(test_score["pearsonr"]))
print("SPEARMANR = {:.4f}".format(test_score["spearmanr"]))
RMSE = 0.75
PEARSONR = 0.8745
SPEARMANR = 0.8744

We currently support dumping timm image models, MMDetection image models, HuggingFace text models, and any fusion models that comprises the aforementioned models. Similarly, we can also load a custom trained timm image model with:

{"model.timm_image.checkpoint_name": timm_image_model_path}

and a custom trained MMDetection model with:

{"model.mmdet_image.checkpoint_name": mmdet_image_model_path}

This feature helps you apply the knowledge of your previously trained task onto a new task, which saves your time and computational power. We will not go into details in this tutorial, but do keep in mind that we have not addressed a big challenge in this use case, i.e. Catastrophic Forgetting.