Text-to-Text Semantic Matching with AutoMM

Open In Colab Open In SageMaker Studio Lab

Computing the similarity between two sentences/passages is a common task in NLP, with several practical applications such as web search, question answering, documents deduplication, plagiarism comparison, natural language inference, recommendation engines, etc. In general, text similarity models will take two sentences/passages as input and transform them into vectors, and then similarity scores calculated using cosine similarity, dot product, or Euclidean distances are used to measure how alike or different of the two text pieces.

Prepare your Data

In this tutorial, we will demonstrate how to use AutoMM for text-to-text semantic matching with the Stanford Natural Language Inference (SNLI) corpus. SNLI is a corpus contains around 570k human-written sentence pairs labeled with entailment, contradiction, and neutral. It is a widely used benchmark for evaluating the representation and inference capbility of machine learning methods. The following table contains three examples taken from this corpus.

Premise

Hypothesis

Label

A black race car starts up in front of a crowd of people.

A man is driving down a lonely road.

contradiction

An older and younger man smiling.

Two men are smiling and laughing at the cats playing on the floor.

neutral

A soccer game with multiple males playing.

Some men are playing a sport.

entailment

Here, we consider sentence pairs with label entailment as positive pairs (labeled as 1) and those with label contradiction as negative pairs (labeled as 0). Sentence pairs with neural relationship are discarded. The following code downloads and loads the corpus into dataframes.

from autogluon.core.utils.loaders import load_pd
import pandas as pd

snli_train = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_train.csv', delimiter="|")
snli_test = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_test.csv', delimiter="|")
snli_train.head()
premise hypothesis label
0 A person on a horse jumps over a broken down a... A person is at a diner , ordering an omelette . 0
1 A person on a horse jumps over a broken down a... A person is outdoors , on a horse . 1
2 Children smiling and waving at camera There are children present 1
3 Children smiling and waving at camera The kids are frowning 0
4 A boy is jumping on skateboard in the middle o... The boy skates down the sidewalk . 0

Train your Model

Ideally, we want to obtain a model that can return high/low scores for positive/negative text pairs. Traditional text similarity methods only work on a lexical level without taking the semantic aspect into account, for example, using term frequency or tf-idf vectors. With AutoMM, we can easily train a model that captures the semantic relationship between sentences. Basically, it uses BERT to project each sentence into a high-dimensional vector and treat the matching problem as a classification problem following the design in sentence transformers. With AutoMM, you just need to specify the query, response, and label column names and fit the model on the training dataset without worrying the implementation details. Note that the labels should be binary, and we need to specify the match_label, which means two sentences have the same semantic meaning. In practice, your tasks may have different labels, e.g., duplicate or not duplicate. You may need to define the match_label by considering your specific task contexts.

from autogluon.multimodal import MultiModalPredictor

# Initialize the model
predictor = MultiModalPredictor(
        problem_type="text_similarity",
        query="premise", # the column name of the first sentence
        response="hypothesis", # the column name of the second sentence
        label="label", # the label column name
        match_label=1, # the label indicating that query and response have the same semantic meanings.
        eval_metric='auc', # the evaluation metric
    )

# Fit the model
predictor.fit(
    train_data=snli_train,
    time_limit=180,
)
No path specified. Models will be saved in: "AutogluonModels/ag-20240614_003307"
=================== System Info ===================
AutoGluon Version:  1.1.1b20240613
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri May 17 18:07:48 UTC 2024
CPU Count:          8
Pytorch Version:    2.3.1+cu121
CUDA Version:       12.1
Memory Avail:       28.38 GB / 30.95 GB (91.7%)
Disk Space Avail:   175.74 GB / 255.99 GB (68.7%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
/home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/metric.py:116: UserWarning: Metric auc is not supported as the evaluation metric for binary in matching tasks.The evaluation metric is changed to roc_auc by default.
  warnings.warn(

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/semantic_matching/AutogluonModels/ag-20240614_003307
    ```

Seed set to 0
/home/ci/opt/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.42GB/15.0GB (Used/Total)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | query_model       | HFAutoModelForTextPrediction | 33.4 M | train
1 | response_model    | HFAutoModelForTextPrediction | 33.4 M | train
2 | validation_metric | BinaryAUROC                  | 0      | train
3 | loss_func         | ContrastiveLoss              | 0      | train
4 | miner_func        | PairMarginMiner              | 0      | train
---------------------------------------------------------------------------
33.4 M    Trainable params
0         Non-trainable params
33.4 M    Total params
133.440   Total estimated model params size (MB)
Time limit reached. Elapsed time is 0:03:00. Signaling Trainer to stop.
Epoch 0, global step 157: 'val_roc_auc' reached 0.89398 (best 0.89398), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/semantic_matching/AutogluonModels/ag-20240614_003307/epoch=0-step=157.ckpt' as top 3
Start to fuse 1 checkpoints via the greedy soup algorithm.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/semantic_matching/AutogluonModels/ag-20240614_003307")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f1d47da60e0>

Evaluate on Test Dataset

You can evaluate the macther on the test dataset to see how it performs with the roc_auc score:

score = predictor.evaluate(snli_test)
print("evaluation score: ", score)
evaluation score:  {'roc_auc': 0.9090712842233175}

Predict on a New Sentence Pair

We create a new sentence pair with similar meaning (expected to be predicted as \(1\)) and make predictions using the trained model.

pred_data = pd.DataFrame.from_dict({"premise":["The teacher gave his speech to an empty room."], 
                                    "hypothesis":["There was almost nobody when the professor was talking."]})

predictions = predictor.predict(pred_data)
print('Predicted entities:', predictions[0])
Predicted entities: 1

Predict Matching Probabilities

We can also compute the matching probabilities of sentence pairs.

probabilities = predictor.predict_proba(pred_data)
print(probabilities)
          0         1
0  0.207448  0.792552

Extract Embeddings

Moreover, we support extracting embeddings separately for two sentence groups.

embeddings_1 = predictor.extract_embedding({"premise":["The teacher gave his speech to an empty room."]})
print(embeddings_1.shape)
embeddings_2 = predictor.extract_embedding({"hypothesis":["There was almost nobody when the professor was talking."]})
print(embeddings_2.shape)
(1, 384)
(1, 384)

Other Examples

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization

To learn how to customize AutoMM, please refer to Customize AutoMM.