Different Techniques for Sentence Semantic Similarity in NLP

Last Updated : 01 Feb, 2024
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Save
Share
Report
News Follow

Semantic similarity is the similarity between two words or two sentences/phrase/text. It measures how close or how different the two pieces of word or text are in terms of their meaning and context.

In this article, we will focus on how the semantic similarity between two sentences is derived. We will cover the following most used models.

  1. Dov2Vec – An extension of word2vec
  2. SBERT – Transformer-based model in which the encoder part captures the meaning of words in a sentence.
  3. InferSent -It uses bi-directional LSTM to encode sentences and infer semantics.
  4. USE (universal sentence encoder) – It’s a model trained by Google that generates fixed-size embeddings for sentences that can be used for any NLP task.

What is Semantic Similarity?

Semantic Similarity refers to the degree of similarity between the words. The focus is on the structure and lexical resemblance of words and phrases. Semantic similarity delves into the understanding and meaning of the content. The aim of the similarity is to measure how closely related or analogous the concepts, ideas, or information conveyed in two texts are.

In NLP semantic similarity is used in various tasks such as

  1. Question Answering – Enhances QA system by deriving semantic similarity between user queries and document content.
  2. Recommendation systems – Semantic similarity between user content and available content
  3. Summarization – Helps in summarizing similar content question answering, and text matching.
  4. Corpus clustering -Helps in grouping documents with similar content.

There are certain approaches for measuring semantic similarity in natural language processing (NLP) that include word embeddings, sentence embeddings, and transformer models.

Word Embedding

To understand semantic relationships between sentences one must be aware of the word embeddings. Word embeddings are used for vectorized representation of words. The simplest form of word embedding is a one-hot vector. However, these are sparse, very high dimensional, and do not capture meaning. The more advanced form consists of the Word2Vec (skip-gram, cbow), GloVe, and Fasttext which capture semantic information in low dimensional space. Kindly look at the embedded link to get a deeper understanding of this.

Word2Vec

Word2Vec represents the words as high-dimensional vectors so that we get semantically similar words close to each other in the vector space. There are two main architectures for Word2Vec:

  • Continuous Bag of Words: The objective is to predict the target word based on the context of surrounding words.
  • Skip-gram: The model is designed to predict the surrounding words in the context.

Doc2Vec

Similar to word2vec Doc2Vec has two types of models based on skip gram and CBOW. We will look at the skip gram-based model as this model performs better than the cbow-based model. This skip-gram-based model is called ad PV-DM (Distributed Memory Model of Paragraph Vectors).

Simantic Similarity - Doc2Vec
PV-DM model

PV-DM is an extension of Word2Vec in the sense that it consists of one paragraph vector in addition to the word vectors.

  • Paragraph Vector and Word Vectors: Suppose there are N paragraphs in the corpus, and M words in the vocabulary.
    • Now based on the lines of word2vec we will have M*Q (Q is the dimesnon of word embedding) matrix which will be our embedding matrix for the word embedding. Additionally, we will have an N*P matrix for our paragraphs (p is the dimension of paragraph embedding).
    • The paragraph vector is shared across all contexts generated from the same paragraph but not across different paragraphs.
    • The word vector matrix W is shared across paragraphs.
  • Averaging or Concatenation: To predict the next word in a context, the paragraph vector and word vectors are combined using either averaging or concatenation.
  • Distributed Memory Model (PV-DM): The paragraph token acts as a memory that retains information about what is missing from the current context or the topic of the paragraph.
  • Training with Stochastic Gradient Descent: Stochastic gradient descent is used to train the paragraph vectors and word vectors. The gradient is obtained via backpropagation. During each step of stochastic gradient descent, a fixed-length context is sampled from a random paragraph, and the error gradient is computed to update the model parameters.
  • Inference at Prediction Time: Once the model is trained the paragraph vectors are discarded and only the word vectors and softmax weights are retained.
    • For finding the paragraph vector of new text During prediction time, an inference step is performed. This is also achieved through gradient descent. In this step, the word vectors (W) and softmax weights, are fixed while the paragraph vector is learned through backpropagation.
    • For two sentences for which we need to find the similarity the two paragraph vectors are obtained and based on the similarity between the two vectors the similarity between the sentences is obtained

In summary, the algorithm itself has two key stages:

  • Training to get word vectors W, softmax weights U, b, and paragraph vectors D on already seen paragraphs.
  • The inference stage is to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W, U, and mixed.

We use the learned paragraph vectors to predict some particular labels using a standard classifier, e.g., logistic regression.

Python Implementation of Doc2Vec

Below is the simple implementation of Doc2Vec.

  1. We first tokenize the words in each document and convert them to lowercase.
  2. We then create TaggedDocument objects required for training the Doc2Vec model. Each document is associated with a unique tag (document ID). This is the paragraph vector.
  3. The parameters (vector_size, window, min_count, workers, epochs) control the model’s dimensions, context window size, minimum word count, parallelization, and training epochs.
  4. We then infer a vector representation for a new document that was not part of the training data.
  5. We then calculate the similarity score.
Python3
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Sample data
data = ["The movie is awesome. It was a good thriller",
        "We are learning NLP throughg GeeksforGeeks",
        "The baby learned to walk in the 5th month itself"]

# Tokenizing the data
tokenized_data = [word_tokenize(document.lower()) for document in data]

# Creating TaggedDocument objects
tagged_data = [TaggedDocument(words=words, tags=[str(idx)])
               for idx, words in enumerate(tokenized_data)]


# Training the Doc2Vec model
model = Doc2Vec(vector_size=100, window=2, min_count=1, workers=4, epochs=1000)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count,
            epochs=model.epochs)

# Infer vector for a new document
new_document = "The baby was laughing and palying"
print('Original Document:', new_document)

inferred_vector = model.infer_vector(word_tokenize(new_document.lower()))

# Find most similar documents
similar_documents = model.dv.most_similar(
    [inferred_vector], topn=len(model.dv))

# Print the most similar documents
for index, score in similar_documents:
    print(f"Document {index}: Similarity Score: {score}")
    print(f"Document Text: {data[int(index)]}")
    print()

Output:

Original Document: The baby was laughing and palying
Document 2: Similarity Score: 0.9838361740112305
Document Text: The baby learned to walk in the 5th month itself

Document 0: Similarity Score: 0.9455077648162842
Document Text: The movie is awesome. It was a good thriller

Document 1: Similarity Score: 0.8828089833259583
Document Text: We are learning NLP throughg GeeksforGeeks

SBERT

SBERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding. The sentence is converted into word embedding and passed through a BERT network to get the context vector. Researchers experimented with different pooling options but found that at the mean pooling works the best. The context vector is then averaged out to get the sentence embeddings.

SBERT uses three objective functions to update the weights of the BERT model. The Bert model is structured differently based on the type of training data that drives the objective function.

1. Classification objective Function

  • This model architecture uses part of a sentence along with labels as training data.
  • Here the Bert model is structured as a siamese network. What is the Siamese netowrk? It consists of two identical subnetworks each of which is a BERT model. The two models share/have the same parameters/weights. Parameter updating is mirrored across both sub-models. On the top of the polling layer, we have softmax classifier with the number of nodes the as number of labels the in-training data.
  • The sentences are passed together to get sentence embeddings u and v along with element-wise different |u-v|. These three vectors (u, v,|u-v|) are multiplied with a weight vector W of size (3n*K) to get a softmax classification.
    \text{Objective functions} = softmax(W^T(u,v,|u-v|))
    Here,
    • n is the dimension of the sentence embeddings
    • k the number of labels.
  • The optimization is performed using cross-entropy loss.
Image-41
SBERT with Classification Objective Function

2. Regression Objective function

This also uses the pair of sentences with labels as training data. The network is also structured as a Siamese network. However, instead of the softmax layer the output of the pooling layer is used to calculate cosine similarity and mean squared-error loss is used as the objective function to train the BERT model weights.

SBERT with Regression Objective Function-Geeksforgeeks
SBERT with Regression Objective Function

3. Triplet objective function

Here the model is structured as triplet networks.

  • In a Triplet Network, three subnetworks process an anchor sentence, a positive (similar) sentence, and a negative (dissimilar) sentence. The model learns to minimize the distance between the anchor and positive sentences while maximizing the distance between the anchor and negative sentences.
  • To train the model we need a dataset that has an anchor dataset a, a positive sentence p, and a negative sentence n. An example of such a data set is ‘The Wikipedia section triplets dataset’.

Mathematically, we minimize the following loss function.

max(||sa − sp|| − ||sa − sn|| + \epsilon, 0)

  • with sa, sp, sn the sentence embedding for a/n/p,
  • || · || a distance metric and margin.
  • Margin ϵ ensures that sp is at least closer to sa than sn.

Python Implementation

To implement it first we need to install Sentence transformer framework

!pip install -U sentence-transformers
  • The SentenceTransformer class is used to load an SBERT model.
  • The SentenceTransformer class is used to load an SBERT model.
  • We use the scipy cosine distance to calculate the distance between two vectors. To get similarity we subtract it from 1
Python3
#!pip install -U sentence-transformers

from scipy.spatial import distance
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentence
sentences = ["The movie is awesome. It was a good thriller",
             "We are learning NLP throughg GeeksforGeeks",
             "The baby learned to walk in the 5th month itself"]


test = "I liked the movie."
print('Test sentence:',test)
test_vec = model.encode([test])[0]


for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec, model.encode([sent])[0])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

Output:

Test sentence: I liked the movie.

For The movie is awesome. It was a good thriller
Similarity Score = 0.682051956653595

For We are learning NLP throughg GeeksforGeeks
Similarity Score = 0.0878136083483696

For The baby learned to walk in the 5th month itself
Similarity Score = 0.04816452041268349

InferSent

The structure comprises two components:

Sentence Encoder

  • The first is a sentence encoder responsible for receiving word vectors and transforming sentences into encoded vectors.
  • InferSent starts with pre-trained word embeddings. Words are embedded into continuous vector representations.
  • These embeddings serve as the input to the bi-directional LSTM that is capable of capturing sequential information and dependencies in data.
  • After processing the input sentence through the bi-directional LSTM, a pooling layer is applied to obtain a fixed-size vector representation of the entire sentence. Common pooling techniques include max pooling, mean pooling, or concatenation of the final hidden states.

Classifier

  • The pooled sentence representation is fed through one or more fully connected layers.
  • These layers serve as the classifier part whick that takes the pooled encoded vectors as input and generates a classification, identifying whether the relationship between the sentences is entailment, contradiction, or neutral.
  • 3 matching methods are applied to extract relations between pair of sentences u and v.
    • concatenation of the two representations (u, v)
    • element-wise product u * v
    • element-wise difference |u — v |
  • The resulting vector is fed into a 3-class classifier (entailment, contradiction, and neutral) consisting of multiple fully connected layers followed by a SoftMax layer.
Semantic Similarity - Infersent
Training Sentence Encoder for Classification

Python Implementation

Implementing an infersent model is a bit lengthy process as there is no standard hugging face API available. Infersent comes traiend in two version. Version1 is trained with Glove and Version2 is trained with FastText wordembedding. We will use version 2 as it takes less time to download and process.

First, we need to build the infersent model. The below class does that. It has been sourced from Python Zoo model.

Python3
#Infersenct model class # copied from infersent github

%load_ext autoreload
%autoreload 2
%matplotlib inline
from random import randint
import numpy as np
import torch
import time
import torch.nn as nn
class InferSent(nn.Module):

    def __init__(self, config):
        super(InferSent, self).__init__()
        self.bsize = config['bsize']
        self.word_emb_dim = config['word_emb_dim']
        self.enc_lstm_dim = config['enc_lstm_dim']
        self.pool_type = config['pool_type']
        self.dpout_model = config['dpout_model']
        self.version = 1 if 'version' not in config else config['version']

        self.enc_lstm = nn.LSTM(self.word_emb_dim, self.enc_lstm_dim, 1,
                                bidirectional=True, dropout=self.dpout_model)

        assert self.version in [1, 2]
        if self.version == 1:
            self.bos = '<s>'
            self.eos = '</s>'
            self.max_pad = True
            self.moses_tok = False
        elif self.version == 2:
            self.bos = '<p>'
            self.eos = '</p>'
            self.max_pad = False
            self.moses_tok = True

    def is_cuda(self):
        # either all weights are on cpu or they are on gpu
        return self.enc_lstm.bias_hh_l0.data.is_cuda

    def forward(self, sent_tuple):
        # sent_len: [max_len, ..., min_len] (bsize)
        # sent: (seqlen x bsize x worddim)
        sent, sent_len = sent_tuple

        # Sort by length (keep idx)
        sent_len_sorted, idx_sort = np.sort(sent_len)[::-1], np.argsort(-sent_len)
        sent_len_sorted = sent_len_sorted.copy()
        idx_unsort = np.argsort(idx_sort)

        idx_sort = torch.from_numpy(idx_sort).cuda() if self.is_cuda() \
            else torch.from_numpy(idx_sort)
        sent = sent.index_select(1, idx_sort)

        # Handling padding in Recurrent Networks
        sent_packed = nn.utils.rnn.pack_padded_sequence(sent, sent_len_sorted)
        sent_output = self.enc_lstm(sent_packed)[0]  # seqlen x batch x 2*nhid
        sent_output = nn.utils.rnn.pad_packed_sequence(sent_output)[0]

        # Un-sort by length
        idx_unsort = torch.from_numpy(idx_unsort).cuda() if self.is_cuda() \
            else torch.from_numpy(idx_unsort)
        sent_output = sent_output.index_select(1, idx_unsort)

        # Pooling
        if self.pool_type == "mean":
            sent_len = torch.FloatTensor(sent_len.copy()).unsqueeze(1).cuda()
            emb = torch.sum(sent_output, 0).squeeze(0)
            emb = emb / sent_len.expand_as(emb)
        elif self.pool_type == "max":
            if not self.max_pad:
                sent_output[sent_output == 0] = -1e9
            emb = torch.max(sent_output, 0)[0]
            if emb.ndimension() == 3:
                emb = emb.squeeze(0)
                assert emb.ndimension() == 2

        return emb

    def set_w2v_path(self, w2v_path):
        self.w2v_path = w2v_path

    def get_word_dict(self, sentences, tokenize=True):
        # create vocab of words
        word_dict = {}
        sentences = [s.split() if not tokenize else self.tokenize(s) for s in sentences]
        for sent in sentences:
            for word in sent:
                if word not in word_dict:
                    word_dict[word] = ''
        word_dict[self.bos] = ''
        word_dict[self.eos] = ''
        return word_dict

    def get_w2v(self, word_dict):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        # create word_vec with w2v vectors
        word_vec = {}
        with open(self.w2v_path) as f:
            for line in f:
                word, vec = line.split(' ', 1)
                if word in word_dict:
                    word_vec[word] = np.fromstring(vec, sep=' ')
        print('Found %s(/%s) words with w2v vectors' % (len(word_vec), len(word_dict)))
        return word_vec

    def get_w2v_k(self, K):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        # create word_vec with k first w2v vectors
        k = 0
        word_vec = {}
        with open(self.w2v_path) as f:
            for line in f:
                word, vec = line.split(' ', 1)
                if k <= K:
                    word_vec[word] = np.fromstring(vec, sep=' ')
                    k += 1
                if k > K:
                    if word in [self.bos, self.eos]:
                        word_vec[word] = np.fromstring(vec, sep=' ')

                if k > K and all([w in word_vec for w in [self.bos, self.eos]]):
                    break
        return word_vec

    def build_vocab(self, sentences, tokenize=True):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        word_dict = self.get_word_dict(sentences, tokenize)
        self.word_vec = self.get_w2v(word_dict)
        print('Vocab size : %s' % (len(self.word_vec)))

    # build w2v vocab with k most frequent words
    def build_vocab_k_words(self, K):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        self.word_vec = self.get_w2v_k(K)
        print('Vocab size : %s' % (K))

    def update_vocab(self, sentences, tokenize=True):
        assert hasattr(self, 'w2v_path'), 'warning : w2v path not set'
        assert hasattr(self, 'word_vec'), 'build_vocab before updating it'
        word_dict = self.get_word_dict(sentences, tokenize)

        # keep only new words
        for word in self.word_vec:
            if word in word_dict:
                del word_dict[word]

        # udpate vocabulary
        if word_dict:
            new_word_vec = self.get_w2v(word_dict)
            self.word_vec.update(new_word_vec)
        else:
            new_word_vec = []
        print('New vocab size : %s (added %s words)'% (len(self.word_vec), len(new_word_vec)))

    def get_batch(self, batch):
        # sent in batch in decreasing order of lengths
        # batch: (bsize, max_len, word_dim)
        embed = np.zeros((len(batch[0]), len(batch), self.word_emb_dim))

        for i in range(len(batch)):
            for j in range(len(batch[i])):
                embed[j, i, :] = self.word_vec[batch[i][j]]

        return torch.FloatTensor(embed)

    def tokenize(self, s):
        from nltk.tokenize import word_tokenize
        if self.moses_tok:
            s = ' '.join(word_tokenize(s))
            s = s.replace(" n't ", "n 't ")  # HACK to get ~MOSES tokenization
            return s.split()
        else:
            return word_tokenize(s)

    def prepare_samples(self, sentences, bsize, tokenize, verbose):
        sentences = [[self.bos] + s.split() + [self.eos] if not tokenize else
                     [self.bos] + self.tokenize(s) + [self.eos] for s in sentences]
        n_w = np.sum([len(x) for x in sentences])

        # filters words without w2v vectors
        for i in range(len(sentences)):
            s_f = [word for word in sentences[i] if word in self.word_vec]
            if not s_f:
                import warnings
                warnings.warn('No words in "%s" (idx=%s) have w2v vectors. \
                               Replacing by "</s>"..' % (sentences[i], i))
                s_f = [self.eos]
            sentences[i] = s_f

        lengths = np.array([len(s) for s in sentences])
        n_wk = np.sum(lengths)
        if verbose:
            print('Nb words kept : %s/%s (%.1f%s)' % (
                        n_wk, n_w, 100.0 * n_wk / n_w, '%'))

        # sort by decreasing length
        lengths, idx_sort = np.sort(lengths)[::-1], np.argsort(-lengths)
        sentences = np.array(sentences)[idx_sort]

        return sentences, lengths, idx_sort

    def encode(self, sentences, bsize=64, tokenize=True, verbose=False):
        tic = time.time()
        sentences, lengths, idx_sort = self.prepare_samples(
                        sentences, bsize, tokenize, verbose)

        embeddings = []
        for stidx in range(0, len(sentences), bsize):
            batch = self.get_batch(sentences[stidx:stidx + bsize])
            if self.is_cuda():
                batch = batch.cuda()
            with torch.no_grad():
                batch = self.forward((batch, lengths[stidx:stidx + bsize])).data.cpu().numpy()
            embeddings.append(batch)
        embeddings = np.vstack(embeddings)

        # unsort
        idx_unsort = np.argsort(idx_sort)
        embeddings = embeddings[idx_unsort]

        if verbose:
            print('Speed : %.1f sentences/s (%s mode, bsize=%s)' % (
                    len(embeddings)/(time.time()-tic),
                    'gpu' if self.is_cuda() else 'cpu', bsize))
        return embeddings

    def visualize(self, sent, tokenize=True):

        sent = sent.split() if not tokenize else self.tokenize(sent)
        sent = [[self.bos] + [word for word in sent if word in self.word_vec] + [self.eos]]

        if ' '.join(sent[0]) == '%s %s' % (self.bos, self.eos):
            import warnings
            warnings.warn('No words in "%s" have w2v vectors. Replacing \
                           by "%s %s"..' % (sent, self.bos, self.eos))
        batch = self.get_batch(sent)

        if self.is_cuda():
            batch = batch.cuda()
        output = self.enc_lstm(batch)[0]
        output, idxs = torch.max(output, 0)
        # output, idxs = output.squeeze(), idxs.squeeze()
        idxs = idxs.data.cpu().numpy()
        argmaxs = [np.sum((idxs == k)) for k in range(len(sent[0]))]

        # visualize model
        import matplotlib.pyplot as plt
        plt.figure(figsize=(12,12))
        x = range(len(sent[0]))
        y = [100.0 * n / np.sum(argmaxs) for n in argmaxs]
        plt.xticks(x, sent[0], rotation=45)
        plt.bar(x, y)
        plt.ylabel('%')
        plt.title('Visualisation of words importance')
        plt.show()

        return output, idxs, argmaxs

Next, we download the state-of-the-art fastText embeddings and download the pre-trained models.

Python3
!mkdir fastText
!curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
!unzip fastText/crawl-300d-2M.vec.zip -d fastText/

!mkdir encoder
!curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl


  • The code downloads the Punkt tokenizer model from NLTK, which is used for tokenization.
  • We then set the path (MODEL_PATH) to the pre-trained InferSent model file and loads its weights into an instance of the InferSent class.
  • We specify the path to our word embeddings(W2V_PATH)
  • The build_vocab_k_words method is called to load the embeddings for the K most frequent words (in this case, the top 100,000 words).
Python3
import nltk
nltk.download('punkt')


MODEL_PATH = 'encoder/infersent2.pkl' 
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))


W2V_PATH = 'fastText/crawl-300d-2M.vec'
model.set_w2v_path(W2V_PATH)

# Load embeddings of K most frequent words
model.build_vocab_k_words(K=100000)


We then use the model for inference

Python3
# Sample sentence
from scipy.spatial import distance
sentences = ["The movie is awesome. It was a good thriller",
             "We are learning NLP throughg GeeksforGeeks",
             "The baby learned to walk in the 5th month itself"]


test = "I liked the movie"
print('Test Sentence:', test)
test_vec = model.encode([test])[0]

for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec, model.encode([sent])[0])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

Output:

Test Sentence: I liked the movie

For The movie is awesome. It was a good thriller
Similarity Score = 0.5299297571182251

For We are learning NLP throughg GeeksforGeeks
Similarity Score = 0.33156681060791016

For The baby learned to walk in the 5th month itself
Similarity Score = 0.20128820836544037

USE – Universal Sentence Encoder

At a high level, it consists of an encoder that summarizes any sentence to give a sentence embedding which can be used for any NLP task.

The encoder part comes in two forms and either of them can be used

  • Transformer – Here the encoder part of the original transformer architecture is used. The architecture consists of 6 stacked transformer layers. Each layer has a self-attention module followed by a feed-forward network. The output context-aware word embeddings are added element-wise and divided by the square root of the length of the sentence to account for the sentence-length difference. We get a 512-dimensional vector as output sentence embedding.
  • Deep averaging network- the embeddings for word and bi-grams present in a sentence are averaged together. Then, they are passed through a 4-layer feed-forward deep DNN to get 512-dimensional sentence embedding as output. The embeddings for word and bi grams are learned during training.

Training of the USE

The USE is trained on a variety of unsupervised and supervised tasks such as Skipthoughts, NLI, and more using the below principles.

  • Tokenize the sentences after converting them to lowercase
  • Depending on the type of encoder, the sentence gets converted to a 512-dimensional vector
  • The resulting sentence embeddings are subsequently used to update the model parameters.
  • The trained model is then once again applied to produce a fresh 512-dimensional sentence embedding.
Semantic Similarity - Universal Sentence Encoder
Training of Encoder

Python Implementation

We load the Universal Sentence Encoder’s TF Hub module.

  • module_url contains the URL to load the Universal Sentence Encoder (version 4) from TensorFlow Hub.
  • The hub. load function is used to load the Universal Sentence Encoder model from the specified URL (module_url)
  • We define a function named embed that takes an input text and returns the embeddings using the loaded Universal Sentence Encoder model.
Python3
import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print("module %s loaded" % module_url)


def embed(input):
    return model(input)


We draw the similarity score on our sample data below

Python3
from scipy.spatial import distance


test = ["I liked the movie very much"]
print('Test Sentence:',test)
test_vec = embed(test)
# Sample sentence
sentences = [["The movie is awesome and It was a good thriller"],
        ["We are learning NLP throughg GeeksforGeeks"],
        ["The baby learned to walk in the 5th month itself"]]

for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec[0,:],embed(sent)[0,:])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

Output

Test Sentence: ['I liked the movie very much']

For ['The movie is awesome and It was a good thriller']
Similarity Score = 0.6519516706466675

For ['We are learning NLP throughg GeeksforGeeks']
Similarity Score = 0.06988027691841125

For ['The baby learned to walk in the 5th month itself']
Similarity Score = -0.01121298223733902

Conclusion

In this article we understood semantic similarity and its application. We saw the architecture of top 4 sentence embedding models used for semantic similarity calculation of sentences and their implementation in python.



Similar Reads

Python | Measure similarity between two sentences using cosine similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||.||B||) where A and B are vectors. Cosine similarity and nltk toolkit module are used in this program. To execute this program nltk must be installed in your system. In or
2 min read
When to use Cosine Similarity over Euclidean Similarity?
Answer: Use Cosine Similarity over Euclidean Similarity when you want to measure the similarity between two vectors regardless of their magnitude and focus on the direction of the vectors in a high-dimensional space.Cosine Similarity and Euclidean Similarity are two distinct metrics used for measuring similarity between vectors, each with its own s
2 min read
Find most similar sentence in the file to the input sentence | NLP
In this article, we will find the most similar sentence in the file to the input sentence. Example: File content: "This is movie." "This is romantic movie" "This is a girl." Input: "This is a boy" Similar sentence to input: "This is a girl", "This is movie". Approach: Create a list to store all the unique words of the file.Convert all the sentences
2 min read
NLP | WuPalmer - WordNet Similarity
How does Wu &amp; Palmer Similarity work? It calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS (Least Common Subsumer). The score can be 0 &lt; score &lt;= 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of taxonomy is one).
2 min read
NLP | Leacock Chordorow (LCH) and Path similarity for Synset
Path-based Similarity: It is a similarity measure that finds the distance that is the length of the shortest path between two synsets. Leacock Chordorow (LCH) : It is a similarity measure which is an extended version of Path-based similarity as it incorporates the depth of the taxonomy. Therefore, it is the negative log of the shortest path (spath)
1 min read
Sentence Similarity using BERT Transformer
Conventional techniques for assessing sentence similarity frequently struggle to grasp the intricate nuances and semantic connections found within sentences. With the rise of Transformer-based models such as BERT, RoBERTa, and GPT, there is potential to improve sentence similarity measurements with increased accuracy and contextual awareness. The a
5 min read
Different methods to find Document Similarity
In natural language processing (NLP), document similarity is a crucial concept that helps in various applications such as search engines, plagiarism detection, and document clustering. This article explores various methods used to determine how similar two documents are, discussing techniques ranging from simple statistical approaches to complex ne
4 min read
NLP | How tokenizing text, sentence, words works
Tokenization in natural language processing (NLP) is a technique that involves dividing a sentence or phrase into smaller units known as tokens. These tokens can encompass words, dates, punctuation marks, or even fragments of words. The article aims to cover the fundamentals of tokenization, it's types and use case. What is Tokenization in NLP?Natu
8 min read
NLP | Training a tokenizer and filtering stopwords in a sentence
Why do we need to train a sentence tokenizer? In NLTK, default sentence tokenizer works for the general purpose and it works very well. But there are chances that it won't work best for some kind of text as that text may use nonstandard punctuation or maybe it is having a unique format. So, to handle such cases, training sentence tokenizer can resu
3 min read
Feature Extraction Techniques - NLP
Introduction : This article focuses on basic feature extraction techniques in NLP to analyse the similarities between pieces of text. Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) language data. Briefly, NLP is the ability of com
11 min read
Text augmentation techniques in NLP
Text augmentation is an important aspect of NLP to generate an artificial corpus. This helps in improving the NLP-based models to generalize better over a lot of different sub-tasks like intent classification, machine translation, chatbot training, image summarization, etc. Text augmentation is used when: There is an absence of sufficient variation
12 min read
Word Embedding Techniques in NLP
Word embedding techniques are a fundamental part of natural language processing (NLP) and machine learning, providing a way to represent words as vectors in a continuous vector space. In this article, we will learn about various word embedding techniques. Table of Content Importance of Word Embedding Techniques in NLPWord Embedding Techniques in NL
6 min read
Natural Language Processing (NLP): 7 Key Techniques
Natural Language Processing (NLP) is a subfield in Deep Learning that makes machines or computers learn, interpret, manipulate and comprehend the natural human language. Natural human language comes under the unstructured data category, such as text and voice. Generally, computers can understand the structured form of data, such as tables and sprea
5 min read
Lemmatization vs. Stemming: A Deep Dive into NLP's Text Normalization Techniques
Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. One of the fundamental tasks in NLP is text normalization, which includes converting words into their base or root forms. This process is essential for various applications such as search engines, text analys
7 min read
Sentiment Analysis in Ancient Texts Using NLP Techniques.
Natural language processing (NLP) has undergone significant advancements in recent years, with applications ranging from chatbots to language translation. One intriguing application is sentiment analysis, where the goal is to determine the emotional tone behind a body of text. While this is relatively straightforward with contemporary texts, applyi
5 min read
Vectorization Techniques in NLP
Vectorization in NLP is the process of converting text data into numerical vectors that can be processed by machine learning algorithms. This article will explore the importance of vectorization in NLP and provide an overview of various vectorization techniques. What is Vectorization?Vectorization is the process of converting text data into numeric
9 min read
NLP Techniques
Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a valuable way. Here, we will delve deeper into the various techniques and methodologies
7 min read
Techniques for Minimizing GPU Overhead in NLP Model Deployment
Deploying Natural Language Processing (NLP) models, especially at scale, demands significant computational resources. GPUs play a crucial role in accelerating these models, but managing GPU overhead is essential to ensure optimal performance and efficiency. This article explores various techniques to reduce GPU overhead when deploying NLP models, e
4 min read
Understanding Semantic Analysis - NLP
Introduction to Semantic AnalysisSemantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to understand the meaning of Natural Language. Understanding Natural Language might seem a straightforward process to us as humans. However, due to the vast complexity and subjectivity involved in human language, interpreting it is qu
6 min read
Semantic Roles in NLP
Semantic roles are labels that describe the relationship between a verb and its arguments, indicating the roles that entities play in a sentence. Semantic roles are crucial in NLP for understanding the meaning of sentences by identifying the relationships between verbs and their arguments. This article explores the concept of semantic roles, method
6 min read
Semantic Search with NLP and elasticsearch
There is a huge amount of data present in today's world and it is challenging to find exact information from it. We can use semantic search paired with powerful tools like Elasticsearch and Natural Language Processing (NLP) to find exactly what we need. This is different from traditional search as it only focuses on matching keywords but semantic s
6 min read
Python | Word Similarity using spaCy
Word similarity is a number between 0 to 1 which tells us how close two words are, semantically. This is done by finding similarity between word vectors in the vector space. spaCy, one of the fastest NLP libraries widely used today, provides a simple method for this task. spaCy's Model - spaCy supports two methods to find word similarity: using con
2 min read
How to Calculate Jaccard Similarity in Python
In Data Science, Similarity measurements between the two sets are a crucial task. Jaccard Similarity is one of the widely used techniques for similarity measurements in machine learning, natural language processing and recommendation systems. This article explains what Jaccard similarity is, why it is important, and how to compute it with Python. W
5 min read
Similarity Search for Time-Series Data
Time-series analysis is a statistical approach for analyzing data that has been structured through time. It entails analyzing past data to detect patterns, trends, and anomalies, then applying this knowledge to forecast future trends. Time-series analysis has several uses, including in finance, economics, engineering, and the healthcare industry. T
15+ min read
Why Use a Gaussian Kernel as a Similarity Metric?
Answer: A Gaussian kernel offers smoothness, flexibility, and non-linearity in capturing complex relationships between data points, making it suitable for various machine-learning tasks such as clustering, classification, and regression.Using a Gaussian kernel as a similarity metric in machine learning has several advantages, which can be explained
3 min read
Movie recommender based on plot summary using TF-IDF Vectorization and Cosine similarity
Recommending movies to users can be done in multiple ways using content-based filtering and collaborative filtering approaches. Content-based filtering approach primarily focuses on the item similarity i.e., the similarity in movies, whereas collaborative filtering focuses on drawing a relation between different users of similar choices in watching
6 min read
SimRank Similarity Measure in Graph-Based Text Mining
SimRank is a similarity measure used to quantify the similarity between nodes in a graph based on the idea that nodes are similar if they are "similar" to each other's neighbors. This article aims to explore the SimRank similarity measure by applying it to graph-based text mining, demonstrating how to compute and visualize SimRank similarity scores
7 min read
RWR Similarity Measure in Graph-Based Text Mining
Graph-based text mining is an essential technique for extracting meaningful patterns and relationships from unstructured text data. One of the powerful methods used in this domain is the Random Walk with Restart (RWR) algorithm. This article delves into the RWR similarity measure, its application in graph-based text mining, and the technical intric
6 min read
Document Similarity Using LSA in R
Latent Semantic Analysis (LSA) is a powerful technique in natural language processing (NLP) that helps uncover the hidden relationships between words in a set of documents. Unlike basic word-matching techniques, LSA goes beyond the surface level by analyzing the context in which words appear. It reduces the dimensionality of text data by mapping hi
5 min read
What are the different Image denoising techniques in computer vision?
Image denoising techniques in computer vision are essential for enhancing the quality of images corrupted by noise, thereby improving the accuracy of subsequent image processing tasks. Noise in images can arise from various sources such as sensor limitations, transmission errors, or environmental factors. The goal of denoising techniques is to effe
8 min read