DSCEH: Dual-Stream Correlation-Enhanced Deep Hashing for Image Retrieval

Yang, Yulin; Chen, Huizhen; Liu, Rongkai; Liu, Shuning; Zhan, Yu; Hu, Chao; Shi, Ronghua

doi:10.3390/math12142221

Open AccessArticle

DSCEH: Dual-Stream Correlation-Enhanced Deep Hashing for Image Retrieval

by

Yulin Yang

^1,†,

Huizhen Chen

^1,†,

Rongkai Liu

^1,†,

Shuning Liu

^1,†,

Yu Zhan

^2,†,

Chao Hu

^3,*,† and

Ronghua Shi

^3,†

¹

School of Computer Science and Engineering, Central South University, Changsha 410083, China

²

China Telecom, Changsha 410083, China

³

School of Electronic Information, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(14), 2221; https://doi.org/10.3390/math12142221

Submission received: 4 June 2024 / Revised: 4 July 2024 / Accepted: 9 July 2024 / Published: 16 July 2024

(This article belongs to the Special Issue New Trends in Computer Vision, Pattern Recognition and Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep Hashing is widely used for large-scale image-retrieval tasks to speed up the retrieval process. Current deep hashing methods are mainly based on the Convolutional Neural Network (CNN) or Vision Transformer (VIT). They only use the local or global features for low-dimensional mapping and only use the similarity loss function to optimize the correlation between pairwise or triplet images. Therefore, the effectiveness of deep hashing methods is limited. In this paper, we propose a dual-stream correlation-enhanced deep hashing framework (DSCEH), which uses the local and global features of the image for low-dimensional mapping and optimizes the correlation of images from the model architecture. DSCEH consists of two main steps: model training and deep-hash-based retrieval. During the training phase, a dual-network structure comprising CNN and VIT is employed for feature extraction. Subsequently, feature fusion is achieved through a concatenation operation, followed by similarity evaluation based on the class token acquired from VIT to establish edge relationships. The Graph Convolutional Network is then utilized to enhance correlation optimization between images, resulting in the generation of high-quality hash codes. This stage facilitates the development of an optimized hash model for image retrieval. In the retrieval stage, all images within the database and the to-be-retrieved images are initially mapped to hash codes using the aforementioned hash model. The retrieval results are subsequently determined based on the Hamming distance between the hash codes. We conduct experiments on three datasets: CIFAR-10, MSCOCO, and NUSWIDE. Experimental results show the superior performance of DSCEH, which helps with fast and accurate image retrieval.

Keywords:

deephashing; convolutional neural network; vision transformer; graph convolutional network

MSC:

68T07

1. Introduction

In the era of big data, the explosive growth of image data shared on social networks has generated significant interest in large-scale image retrieval [1]. To address the challenge of accurate retrieval while ensuring efficient computation, there has been a surge in interest towards approximate nearest neighbor (ANN) search [2]. By leveraging the clustered distribution of large-scale data in space, ANN aims to identify potential neighboring data items instead of focusing solely on returning the most probable matches. This approach provides enhanced retrieval efficiency, albeit with a tolerable loss in accuracy. Among various ANN methods, hash learning [3,4,5] has emerged as a prominent research area due to its ability to achieve low storage requirements and facilitate high-speed retrieval.

The idea of hash learning revolves around the process of mapping high-dimensional image data to low-dimensional hash codes while preserving both the semantic information of the image itself and the similarity between images. Existing hashing methods can be divided into shallow methods and deep-learning-based methods. Shallow methods [6,7,8] learn hash functions by manually extracting image features. However, these methods are constrained by the limitations of the semantic description capability inherent in the features themselves, hindering their practical application effectiveness. In contrast, deep-learning-based methods have demonstrated notable advancements in image hashing. Deep-learning-based methods [3,9] leverage Convolutional Neural Networks (CNNs) to extract image features, followed by low-dimensional mapping using Fully Connected (FC) layers and the Sigmoid function. The overall model is then optimized through a loss function. Deep-learning-based methods can learn more discriminative hash codes through end-to-end training, thereby enhancing the performance of image retrieval. Figure 1 shows the workflow of deep-learning-based methods. As shown in Figure 1, deep hashing involves the conversion of all images into hash codes. The retrieved items’ hash codes are compared with the hash codes stored in the database and subsequently sorted based on the Hamming distance. The retrieval results are then returned accordingly.

Currently, deep hashing methods predominantly rely on Convolutional Neural Network (CNN) as the image feature extractor, employing CNN to extract local features for low-dimensional mapping. With the emergence of Vision Transformer (VIT) [10,11,12], VIT outperforms CNN in various computer vision tasks (image classification, object detection, etc.), and some studies [1,13,14,15] try to use VIT to replace CNN for feature extraction and achieve better results. However, existing approaches typically focus on utilizing a single type of feature (local or global) of the image for low-dimensional mapping. When using CNN to extract local features for low-dimensional mapping, the size of the receptive field limits the global representation ability of the generated hash codes. On the other hand, ViT models global dependencies through a self-attention mechanism and can effectively capture the global features of images. ViT has significant advantages in processing complex image structures and long-distance dependencies compared to CNN. However, this approach overlooks the local details of the image, diminishing the distinguishability between the foreground and background. Consequently, the sole use of any form of image feature negatively impacts the quality of the generated hash codes. Furthermore, existing models [16,17] primarily focus on correlation optimization through manually designed loss functions, without considering the reinforcement of correlation optimization at the model architecture level, which may not fully exploit the potential of model architecture-level enhancements. Due to the inherent limitations in considering all possible scenarios during the design of the loss function, certain situations are inevitably overlooked. And the loss function can only constrain the image similarity of pairs or triples and can only optimize the local data distribution relationship. Consequently, the distribution of the generated hash codes may significantly deviate from the distribution of data in the high-dimensional image space, leading to a decline in retrieval performance.

Regarding these issues, we propose a novel deep hashing framework, Dual-Stream Correlation-Enhanced Deep Hashing (DSCEH), for large-scale image retrieval tasks. The core idea of this framework is to leverage both local and global features of images through a dual-network structure, followed by correlation optimization using a Graph Convolutional Network (GCN) [18]. Specifically, our approach involves constructing a parallel dual-network architecture that integrates both CNN and VIT as complementary image feature extractors. CNN excels in capturing local details, while VIT is proficient in modeling global dependencies across images. First, we extract local features using CNN and global features using VIT separately from the input images. These features are then fused through a concatenation operation, which enhances the representation capability by leveraging both local discriminative details and global context. This integration enriches the semantic information encoded in the low-dimensional binary hash codes. Subsequently, GCN is employed to optimize the correlation among images. Here, the fused features are treated as nodes, and edge relationships are established based on similarity evaluations using global representations derived from VIT. This process enhances feature propagation and facilitates effective information transfer between images, aiming to maximize the alignment of data distributions between the low-dimensional hash space and the high-dimensional image space. By combining the strengths of CNN and VIT in feature extraction and leveraging GCN for correlation optimization, our method not only enhances the discriminative power of generated hash codes but also effectively addresses the challenge of aligning data distributions across different dimensional spaces.

The rest of this paper is organized as follows. We summarize previous research on deep hashing and review the application of Graph Convolutional Networks in the field of data retrieval in Section 2. In Section 3, we give an overview of our framework through formal definitions. Section 4 reports the datasets, experiment settings, compared methods, evaluation criteria, results, and discussions. Finally, we conclude this paper in Section 5.

2. Related Work

In this section, we introduce the related work of image retrieval based on deep hashing, and the application of GCN in the field of data retrieval.

2.1. Deep Hashing for Image Retrieval

As Convolution Neural Network (CNN) continues to advance, and with the rise of Vision Transformer (VIT), most deep hashing methods [1,3,19,20,21] consider one of two backbones for feature extraction. On the one hand, recent works [9,22,23,24,25] have extensively shown that CNN obtains a compact vector representation of local image neighborhoods utilizing hierarchical convolution operations. Song et al. [22] designed a deep adversarial network with CNN as the backbone to embed images into binary hash codes in an unsupervised manner. Zhang et al. [23] combined CNN and KL divergence (KLD) to improve feature robustness by gradually reducing intra-class KLD. On the other hand, VIT aggregates compressed image segmentation patches and obtains global representation through the self-attention module. Qiao et al. [9] deployed an end-to-end CNN-based Siamese network for a specific application of face image retrieval to related videos. Guerin et al. [24] proposed a multi-input neural network architecture, which uses different CNN architectures to extract image features to enrich the information contained in image features. Ng et al. [25] used Vgg19 [26] as the backbone and enhanced the complementarity between features by training hash codes at different levels of Vgg19. A number of studies [1,21] have explored the feature extraction model with VIT as the backbone network and achieved an impressive retrieval effect compared to the CNN-based network. As mentioned above, both CNN-based models and VIT-based models typically utilize a single form of image feature for low-dimensional mapping. While these models have achieved promising experimental results, there is a need to enhance image retrieval performance further. To address this, we introduce a novel approach that leverages both image feature forms to produce hash codes with enhanced semantic representation.

2.2. Graph Convolutional Network for Data Retrieval

The Graph Convolutional Network (GCN) utilizes the adjacency matrix of nodes in a graph to update the features of similar nodes. By leveraging the relationships of edges, GCN iteratively propagates information among nodes to preserve the similarity between input data instances. When combined with hash learning [27], GCN effectively integrates similar information into image features, enabling the generation of hash codes that preserve the inherent relationships present in the original images. This integration enhances the performance of hash retrieval, contributing to improved retrieval accuracy and efficiency.

At present, GCN has been widely used in the field of multi-modal data retrieval. Graph Convolution Hashing (GCH) [28] uses GCN to mine the structural similarity and semantic similarity between text data and image data and designs a semantic encoder to guide the feature encoding process, to obtain better hash codes and achieve better retrieval performance. Graph Convolution Network Hashing (GCNH) [29] improves on GCH by introducing asymmetric graph convolution (AGC) and simultaneously convolving input data, anchor graph, and convolutional filters to solve the scalability problem of the hash of affinity graph. Flexible Graph Convolutional Multi-modal Hashing (FGCMH) [30] uses the structural similarity of intra-modality and modality-fused to learn the fusion features, aggregates the feature information of each modality, uses graphs to optimize the correlation in an adaptive weighted manner, alleviates the gap between modalities, and solves the problem of losing some modal features when querying single modal data.

Benefiting from the influence of the above multi-modal hash retrieval methods, we propose the incorporation of GCN into our framework. In contrast to prior studies [17,31,32] that primarily rely on loss functions to constrain hash code similarity, our approach aims to enhance correlation optimization at the model architecture level. This innovative strategy leverages GCN’s capabilities in facilitating information propagation and reinforcing the relationships among image features. By doing so, we strive to ensure that the low-dimensional hash space maintains a data distribution consistent with the high-dimensional image space.

In this section, we give the framework overview and the details of the proposed DSCEH framework.

We first give a formal description of deep-hash-based image retrieval. Given the image training dataset

X = [x_{1}, x_{2}, \dots, x_{n}]

containing n data samples, where

x_{i} \in R^{W \times H \times C}

is the i-th sample with W width, H height, and C channels, we denote the corresponding label set as

Y = {y_{1}, y_{2}, \dots, y_{n}}

, where

y_{i} \in R^{c}

and c is the number of categories. For any two images in the dataset, we can generate similarity matrix S, where

s_{i j} = 1

if

x_{i}

and

x_{j}

are similar and where

s_{i j} = 0

if

x_{i}

and

x_{j}

are dissimilar. The framework aims to learn a hash function that maps images from high-dimensional space to low-dimensional space to generate k-bit hash codes

H = [h_{1}, h_{2}, \dots, h_{n}] \in {- 1, 1}^{n \times k}

. We formulate the process as:

h = sgn (F (x))

(1)

where

F

is the hash function that needs to be optimized by the model and

sgn (\cdot)

represents the activation function, which is used to discretize the continuous hash code. The important notations and definitions are summarized in Table 1.

2.3. Overview

Figure 2 shows the overview of our proposed DSCEH consisting of two steps: training and retrieval. In the training step, we first employ the parallel CNN and VIT to extract the input image’s local and global features simultaneously and finally realize feature fusion through the concatenation operation. We use the class token from VIT for similarity estimation and construct a similarity adjacency matrix. Then, we introduce GCN to mine the similarity of the image by using the fusion feature and the similarity adjacency matrix. Finally, we obtain a hash function that can generate efficient and compact hash codes through model optimization. In the retrieval step, we use the model optimized in the training stage to quantify all the images in the test set and the database, retrieve the test set according to the Hamming distance, and finally give feedback on the retrieval performance. In the following, we provide a detailed overview of our proposed framework, including feature extraction, graph-based correlation optimization, and learning to preserve the similarity hash.

3. DSCEH Framework

3.1. Feature Extraction

In this stage, we focus on learning an end-to-end network that effectively produces compact representations of image features. The local features and global representations we consider both have been extensively studied and aim to achieve fusion to enhance the representation learning of the network. This paper designs the parallel dual-network structure, which consists of a CNN branch and a transformer branch, which are respectively constructed by the Alexnet [33] and VTS16 [11]. Finally, we fuse the feature of the two branches through the feature fused unit, and this operation dramatically enhances the global perception ability and local expression details of the network.

3.1.1. CNN Branch

It has the pyramidal model structure and achieves local feature extraction by reducing the resolution of the layer feature mapping through the depth of the network. We deploy the classic model Alexnet [33], with its first five layers and well-trained parameters for local feature extraction. As defined in Alexnet, the network flow can be referred to as Convolution, Relu, and Maxpool. The kernel size and stride of the convolution kernel of each layer are [(11,4), (5,1), (3,1)], the kernel size of Maxpool is 3, and the stride is 2. Finally, we can obtain the local features

F_{l o c a l} \in R^{256 \times 6 \times 6}

through the feature mapping formed by such stacked convolution kernels and Maxpool.

3.1.2. Transformer Branch

This branch extracts the global feature of the input image, which consists of the following modules: patch embedding, position embedding, and self-attention encoder. Specifically, given the input image

x

, which is then divided into b patches, i.e.,

b = (H \times W) / P^{2}

, where P is the length of the patch, in this experiment, we divided the 224 × 224 image into 14 × 14 patches, such that the length of each patch is 16. Each image patch

x_{p} \in R^{P \times P \times C}

is flattened into the 1D vector, and then the class token

x_{c l s}

is added to obtain the embedding image patch representation. The embedding vector

x_{e m d}

can be formulated as

x_{e m d} = x_{c l s} + [x_{p 1}, \dots, x_{p b}]

(2)

where

x_{e m d} \in R^{(b + 1) \times (P \times P \times C)}

. We follow the standards of VIT by adding position embedding Pos with consistent dimensions for each embedding vector to obtain the output vector

x_{p o s}

and formulate it as

x_{p o s} = x_{e m d} + Pos

(3)

We notice that the transformer encoder generally consists of M blocks, which sequentially include the layer norm (LN), multi-headed self-attention layer (MSA), and multilayer perceptron (MLP). The computation of the m-th transformer block can be formulated as

\begin{matrix} z_{m}^{'} = MSA (LN (z_{m - 1})) + z_{m - 1} \\ z_{m} = MLP (LN (z_{m}^{'})) + z_{m}^{'} \end{matrix}

(4)

where

m = [1, 2, \dots, M]

. Finally, we can obtain the class token

x_{c l s}

and the global representation

F_{g l o b a l}

in this branch.

3.1.3. Feature Fusion Unit

This unit fuses

F_{l o c a l}

and

F_{g l o b a l}

into

F_{f u s e d}

by the concatenation function and finally obtains the fusion feature

F_{f u s e d}

as follows

F_{f u s e d} = [F_{g l o b a l}, F_{l o c a l}]

(5)

In this way, for any image

x_{i}

, we can extract the corresponding fusion data feature

F_{f u s e d}^{i}

through the designed dual network and the corresponding transformer class token

x_{c l s}^{i}

, which is used for the subsequent correlation optimization.

The fused feature

F_{f u s e d}

enhances the retrieval process by integrating both fine-grained local information and broad global context, which improves the discriminative power of the feature representation. This is particularly beneficial in complex retrieval tasks where local and global information is crucial for accurate matching.

3.2. Graph-Based Correlation Optimization

To make full use of the obtained fused features

F_{f u s e d}

and improve the hash-retrieval accuracy under supervised learning, we exploit GCN to mine correlation information of the image data through the class token

x_{c l s}

and the fused feature

F_{f u s e d}

introduced above. Hence, given the image pair (

x_{i}

,

x_{j}

), we can easily obtain the corresponding class token pair (

x_{c l s}^{i}

,

x_{c l s}^{j}

) via the feature extraction process above.

Following the general procedure introduced in [28], we use class tokens

x_{c l s}

to calculate the cosine distance between image pairs to construct the similarity adjacency matrix

S_{a d j}

to generate edge relationships between nodes. The correlation constraints between nodes are strengthened based on edge relationships. The specific calculation formula is as follows:

S_{a d j} = \frac{x_{c l s}^{i} \cdot x_{c l s}^{j}}{| | x_{c l s}^{i} | | \times | | x_{c l s}^{j} | |}

(6)

where

S_{a d j} \in R^{n \times n}

. With the result of similarity adjacency matrix

S_{a d j}

computation, we adopt a multi-layer GCN structure to utilize fused features with strong semantic correlation to update and generate the interactions to improve the semantic-preservation performance. Specifically, we regard each image as a node, the fused features

F_{f u s e d}

as node features, and similarity adjacency matrix

S_{a d j}

as edge relationships. We continuously optimize node features based on edge relationships to promote feature flow between nodes. As introduced in [28], we formulate the layer-wise propagation rule as follows:

H^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} {\tilde{S}}_{a d j} {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)})

(7)

where

H^{(l)}

denotes the l-th GCN layer.

σ (\cdot)

represents the activation function, e.g., the Relu function.

{\tilde{S}}_{a d j}

=

S_{a d j}

+

I_{n}

and

I_{n}

refers as the unit matrix.

\tilde{D}

is the degree matrix of

{\tilde{S}}_{a d j}

.

W^{(l)}

represents the weight parameter of l-th GCN layer. We dynamically update the network parameters of GCN during training for similarity enhancement.

3.3. Learning to Preserve the Similarity Hash

To further ensure that the hash codes generated by the hash model are used for efficient image retrieval, we set two optimization goals as follows:

Similarity Loss $L_{s i m}$ aims to have the low-dimensional hash space keep similar information consistent with the high-dimensional image space.
Semantic Loss $L_{s e m}$ aims to make the hash code finally generated by the model still effectively maintain the semantic information of the original image.

We jointly optimize the above two loss functions through the Adam optimizer [34] to obtain the best network parameters, and the final loss function

L_{C o D e}

can be obtained as follows:

L_{C o D e} = \underset{W}{arg min} L_{s i m} + σ L_{s e m} + {α | | W | |}_{2}

(8)

where

σ

,

α

is the balance constant,

W

is the model parameter, and

| | \cdot {| |}_{2}

is the

L_{2}

norm. We describe the definition of the loss function in detail below.

3.3.1. Similarity Loss $L_{s i m}$

As described in [21,35], the goal of hash learning is to maximize the similarity between the high-dimensional feature distribution of the original image (denoted as

A

) and the Hamming space distribution of the hash code (denoted as

D

).

a_{i j}

denotes the similarity of original image pair (

x_{i}

,

x_{j}

) in

A

, and

d_{i j}

denotes the similarity in

D

. Here, we use

s_{i j}

instead of

a_{i j}

and calculate

d_{i j}

based on the cosine similarity between binary hash codes

h_{i}

and

h_{j}

.

d_{i j} = \frac{cos (h_{i}, h_{j}) + 1}{2}

(9)

Similar to previous work [36,37], we adopt the JS divergence-based similarity loss function

L_{s i m}

as follows:

\begin{matrix} L_{s i m} = & \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j} s_{i j} log \frac{2 s_{i j}}{s_{i j} + d_{i j}} + \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j} d_{i j} log \frac{2 d_{i j}}{s_{i j} + d_{i j}} \end{matrix}

(10)

where

w_{i j}

refers to the weight of every image pair (

x_{i}

,

x_{j}

), and we use the weighted likelihood method to calculate it.

w_{i j} = \{\begin{matrix} | S | / | S_{1} |, s_{i j} = 1 \\ | S | / | S_{0} |, s_{i j} = 0 \end{matrix}

(11)

where

S_{1} = {s_{i j} \in S : s_{i j} = 1}

denotes the similar items in the dataset; otherwise,

S_{0} = {s_{i j} \in S : s_{i j} = 0}

denotes the dissimilar items in the data set.

3.3.2. Semantic Loss $L_{s e m}$

We make full use of the given label pair information (

y_{i}

,

y_{j}

) to preserve the semantic information of the image. We encode the labels in one-hot form and classify the hash code to obtain classification vectors

L = {l_{i}}_{i = 1}^{n}

, where each

l_{i}

is represented by a c-dimensional vector,

l_{i} = {l_{i 1}, l_{i 2}, . . ., l_{i c}}

. Then, we deploy the cross-entropy function to construct the loss function

L_{s e m}

:

L_{s e m} = - \sum_{i = 1}^{n} p_{i} \sum_{j = 1}^{c} y_{i j} log l_{i j}

(12)

where

p_{i}

is the weight parameter to enhance the penalty for misclassified pairs of data, and we use the precision calculation method to calculate it.

p_{i} = \frac{c_{t} - c_{t p}}{c_{t}}

(13)

where

c_{t}

is the number of images predicted to be the category of the ith image, and where

c_{t p}

is the number of images in the training batch that are correctly classified into the category of the ith image. After the well-trained process, for any input image

x_{i}

, we can obtain the k-bits hash code

h_{i}

.

3.3.3. Model Optimization

In our optimization process, our model utilizes a feedforward neural network to generate low-dimensional binary hash codes. We compute the discrepancy between the output data and labels using the aforementioned two loss functions. Subsequently, we employ the Adam optimizer to adjust model parameters in reverse. Once the variation of the loss function stabilizes, we consider the optimization process complete. To enhance optimization effectiveness, we employ the hyperparameter optimization library Optuna to optimize hyperparameters

σ

and

α

within the loss function.

3.4. Hash Model Application for Image Retrieval

Through the training phase, we can obtain the well-trained hash model. We first binarize all images in the database by the hash model. Next, we input the images in the test set to evaluate the performance of the hash model and report the retrieval performance based on the top item of the return list result. It returns the top-K items of similar images based on the Hamming distance. Hamming distance is the different number of bits corresponding to two hash codes with the same length. Here, we use the inner product of the hash code to measure the Hamming distance

D_{i j}

, which can be calculated as follows.

D_{i j} = k - h_{i} \cdot h_{j}

(14)

4. Experimentation

This section describes the experiment details, including the dataset description, experiment settings, compared methods, and evaluation criteria. Finally, we report the experimental results and employ the discussions.

4.1. Datasets

We conduct our experiments on three popular dataset: CIFAR-10 [16], MSCOCO [38], and NUSWIDE [39]. CIFAR-10 has 60,000 images with 10 categories. MSCOCO contains 132,218 with 80 categories. NUSWIDE has 206,334 images with 21 categories. For experimental comparison, we follow [16] to split the dataset into training dataset, query dataset, and retrieval dataset. We show the details of the three dataset settings in Table 2. In the training phase, we exploit the training dataset and obtain the image data pairs (i.e.,

x_{i}

,

x_{j}

) for model training. In the evaluation phase, we generate the database of the retrieval dataset and obtain the performance of the hashing model by querying the database with the query dataset. The specific image examples are shown in the Figure 3.

4.2. Experiment Settings

We implement the proposed framework and other compared methods on the Pytorch platform. In our experiment, we set the random cropping function for data augmentation, and the input images are cropped to (224, 224). For the feature extraction module, we set the Alexnet [33] as the model backbone for the CNN branch. Moreover, we exploit the VTS16 [11] for the VIT branch, and the detailed model setting is as follows: partition patch size is 16, the hidden size of the encoder is 768, the number of heads for multi-head attention is 12, and the number of the transformer blocks is 6. The GCN we use consists of only two layers. The first layer of GCN is to map the 4096 × 2 fused feature of image

F_{f u s e d}

to 2048 dimensions and the second layer is to map 2048 dimensions to 512 dimensions. For all experiments, we use Adam as the optimizer, the learning rate is set to

1 \times 10^{- 5}

, and the batch size is 100.

4.3. Compared Methods

We compare our DSCEH framework with the following deep hashing methods: DSH [40], HashNet [17], DCH [31], IDHN [20], and QSMIH [41]. In these comparative experiments, Alexnet is used as the feature extractor. DSH, HashNet, and DCH utilize the similarity loss and quantization loss to optimize the hash model. IDHN also optimizes its model using the corresponding loss function, but it has been specifically designed for multi-label scenarios. QSMIH employs Quadratic Mutual Information (QMI) to learn a compact code. The comparative experiments conducted primarily focus on optimizing image similarity by employing suitable loss functions, without explicitly considering the impact of network architecture on retrieval performance. To demonstrate the effectiveness of our framework, we compare it with the open-source code of the aforementioned methods.

4.4. Evaluation Criteria

We exploit the popular criteria, mean Average Precision (mAP) as introduced in [42]. Specifically, mAP is the mean of the Average Precision (AP), which can be calculated as follows:

AP = {\frac{1}{R}}_{r e l} \times \sum_{i = 1}^{K} \frac{R_{r e l i}}{i} r e l (i)

(15)

where K represents the top K items of the return retrieval list,

R_{r e l}

denotes the number of similar images in the topp K items,

R_{r e l i}

is the number of similar images in the first i item and

r e l (i)

indicates whether the ith image is similar. Moreover, we use the Precision–Recall curve (PR curve) [43] for experimental comparison.

4.5. Experimental Results and Discussion

Table 3 presents the results of mean Average Precision (mAP) for each hash bit on the three datasets. Additionally, Figure 4 displays the PR curves for each hash bit on these datasets. Based on the experimental results, we make the following observations. Table 3 confirms that DSCEH consistently achieves better retrieval performance across all experimental settings compared to the other methods. Moreover, Figure 4 demonstrates that the curve of DSCEH consistently outperforms the curves of other methods. This indicates that our proposed framework achieves superior performance in terms of precision and recall compared to the alternative methods.

For example, DSCEH achieves the highest mAP value of 0.9824 on CIFAR-10 with 32-bit hash code compared to other hash methods, where the improvement in the proposed method is 0.1990 compared with QSMIH, 0.1904 compared with IDHN, 0.2266 compared with DCH, 0.2138 compared with HashNet, and 0.5569 compared with DSH. On MSCOCO, DSCEH with a 64-bit hash code has an absolute improvement of 0.1130 compared with QSMIH, 0.1205 compared with IDHN, 0.1424 compared with DCH, 0.0914 compared with HashNet, and 0.3506 compared with DSH. On NUSWIDE, DSCEH with a 64-bit hash code has an improvement of 0.0351 compared with QSMIH, 0.0572 compared with IDHN, 0.0896 compared with DCH, 0.0295 compared with HashNet, and 0.3006 compared with DSH.

The better performance of the proposed DSCEH framework can be attributed to the following reasons. Unlike the state-of-the-art methods, DSCEH performs hash mapping considering the simultaneous extraction of local features and global representations of images. More specifically, we design a dual-stream network feature extractor based on VIT and CNN for image feature extraction, which can consider both the global and local features of an image when conducting image retrieval tasks. As a result, it helps to preserve richer semantic information of the original image and generate higher-quality hash codes than methods that just use a single CNN or VIT as a feature extractor, such as DSH and HashNet. In addition, we use the GCN model to explore the potential relationships between images, especially when the relationship is complex. We introduced GCN in the correlation optimization module to enhance the correlation between similar image features. Compared with other methods, such as QSMIH, IDHN, etc., which only optimize by a similarity loss function, our correlation optimization module can better preserve relevant information in high-dimensional space.

Moreover, we derive the following two observations. First, with the increase of hash bits, the retrieval effect of DSCEH will be improved. This proves that more hash bits can better preserve the semantic information of the original high-dimensional image. Second, the mAP of DSCEH on MSCOCO and NUSWIDE is not as high as CIFAR-10 because the properties of the three datasets are different, and they have much higher data volumes and data categories than CIFAR-10. Specifically, MSCOCO and NUSWIDE contain a higher degree of intra-class variability and a more complex set of inter-class relationships, which can challenge the model’s ability to learn discriminative features effectively. Additionally, the increased data volume requires more robust computational resources and optimization techniques to manage the higher dimensionality and variability, which might not be fully leveraged in our current model configuration.

4.6. Ablation Study

4.6.1. On Feature Extraction

To verify the effectiveness of the feature extractor in the DSCEH, we conduct comparative experiments on CIFAR-10 and only keep the CNN branch (denoted as CNN+GCN) or VIT branch (denoted as VIT+GCN) to extract image features. When we only use CNN as the feature extractor, the features extracted by CNN are regarded as the point information of GCN. The similarity adjacency matrix obtained by similarity evaluation of image features is regarded as edge information. When we only use VIT as the feature extractor, the features extracted by VIT are regarded as point information of GCN. The similarity adjacency matrix obtained by similarity evaluation of class token is regarded as edge information.

We report the mAP and the corresponding PR curve of DSCEH with different feature extraction settings in Table 4 and Figure 5, respectively. From the experimental results, we have the following observation. First, for the model without CNN setting on 32 bits hash code, the value of mAP is reduced from 0.9824 to 0.9147 compared to the original setting, while in the no-VIT setting, the value of mAP is reduced from 0.9824 to 0.8185, and the difference is larger than in the model without the CNN-based feature. Second, for the model without VIT setting, the mAP of 48 bits is 0.0526 higher than the mAP of 64 bits, which shows the instability of the setting. The above experimental results demonstrate the effectiveness of our dual-network feature extractor.

4.6.2. On Correlation Optimization

In addition, in order to test the effectiveness of GCN in the correlation optimization process, we conducted an ablation study without the GCN module (denoted as CNN+VIT) on CIFAR-10 for a comparative test. We report the experimental results with different correlation optimization setting in Table 4. Similarly, the value of mAP in each hash bit length setting significantly decreases. It is obvious that the mAP of 32 bits is much higher than the other lengths. The shortest length but the best effect proves that the model is very unstable without GCN for correlation optimization. We also present the corresponding PR-curve in Figure 6. Hence, we can demonstrate the effectiveness of our designed network.

4.7. Impact on the Number of Vit Encoders

More importantly, since the running time of VIT is much longer than that of CNN. To test our feature extractor’s efficiency, we also explore the effect of the number of encoders in VIT on the performance of our model. As shown in Table 5, on CIFAR-10, when the number of Encoders is 6, the retrieval performance is already excellent, and the value of mAP reaches 0.9877. Even if the number of encoders is increased, the retrieval effect cannot be significantly improved but will increase the running time of the model. This demonstrates that our method can deliver excellent retrieval results without the need for an excessive number of encoders. Furthermore, it operates efficiently during practical usage, avoiding lengthy running times.

5. Conclusions

This paper proposes a novel dual-stream correlation-enhanced deep hashing framework (DSCEH) for large-scale image retrieval tasks. Specifically, we use CNN and VIT to extract local features and global representations of images and perform feature fusion so that the generated hash codes contain richer semantic information. In addition, we introduce GCN to enhance the similarity check of images so that the hash codes can better maintain the similarity relationship of the original space. The entire framework is optimized for training in an end-to-end manner. We conducted extensive experiments on three datasets, CIFAR-10, MSCOCO, and NUSWIDE. The experimental results demonstrate that the retrieval performance of our framework is remarkable compared to other deep hashing methods. Looking ahead, future research will concentrate on refining our model architecture to reduce computational complexity while enhancing retrieval effectiveness. Specifically, we aim to explore adaptive learning techniques for the better handling of intra-class variability in large datasets and investigate ways to incorporate domain-specific priors or constraints to further refine retrieval accuracy.

Author Contributions

Methodology, H.C. and Y.Y.; investigation, R.L.; data curation, S.L.; writing—original draft preparation, H.C. and Y.Z.; writing—review and editing, C.H. and R.S.; project administration, C.H.; funding acquisition, C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was sponsored in part by the National Natural Science Foundation of China (No:62177046), Hunan Provincial Educational Science Research Base Project (No:XJK23AJD021), Philosophy and Social Sciences Foundation of Hunan Province (No:22YBA012), Hunan Province Science and Technology Innovation Project (No:S2021GCZDYF1405) and High Performance Computing Center of Central South University.

Data Availability Statement

The original data presented in the study are openly available in CIFAR-10 at https://www.cs.toronto.edu/~kriz/cifar.html, NUSWIDE at https://www.kaggle.com/datasets/xinleili/nuswide, MSCOCO at https://cocodataset.org/.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments.

Conflicts of Interest

Author Yu Zhan was employed by China Telecom. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chen, Y.; Zhang, S.; Liu, F.; Chang, Z.; Ye, M.; Qi, Z. TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval. In Proceedings of the ICMR ’22: International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; Oria, V., Sapino, M.L., Satoh, S., Kerhervé, B., Cheng, W., Ide, I., Singh, V.K., Eds.; ACM: New York, NY, USA, 2022; pp. 127–136. [Google Scholar] [CrossRef]
Jang, J.; Choi, H.; Bae, H.; Lee, S.; Kwon, M.; Jung, M. CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, 10–12 July 2023; pp. 585–600. [Google Scholar]
Zhang, J.; Peng, Y. Query-Adaptive Image Retrieval by Deep-Weighted Hashing. IEEE Trans. Multim. 2018, 20, 2400–2414. [Google Scholar] [CrossRef]
Teng, S.; Li, J.; Teng, L.; Fei, L.; Wu, N.; Zhang, W. Scalable Discrete and Asymmetric Unequal Length Hashing Learning for Cross-Modal Retrieval. IEEE Trans. Multim. 2024, 26, 7917–7932. [Google Scholar] [CrossRef]
Zhou, H.; Qin, Q.; Hou, J.; Dai, J.; Huang, L.; Zhang, W. Deep global semantic structure-preserving hashing via corrective triplet loss for remote sensing image retrieval. Expert Syst. Appl. 2024, 238, 122105. [Google Scholar] [CrossRef]
Charikar, M. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, Montréal, QC, Canada, 19–21 May 2002; Reif, J.H., Ed.; ACM: New York, NY, USA, 2002; pp. 380–388. [Google Scholar] [CrossRef]
Indyk, P.; Motwani, R.; Raghavan, P.; Vempala, S. Locality-preserving hashing in multidimensional spaces. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, El Paso, TX, USA, 4–6 May 1997; pp. 618–625. [Google Scholar]
Weiss, Y.; Torralba, A.; Fergus, R. Spectral hashing. Adv. Neural Inf. Process. Syst. 2008, 21. [Google Scholar]
Qiao, S.; Wang, R.; Shan, S.; Chen, X. Deep Heterogeneous Hashing for Face Video Retrieval. IEEE Trans. Image Process. 2020, 29, 1299–1312. [Google Scholar] [CrossRef] [PubMed]
Chen, C.R.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 347–356. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K.; Chu, W. Vision Transformer Hashing for Image Retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision Transformers for Single Image Dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Zhang, J.; Du, B.; Zhang, L.; Tao, D. DCN-T: Dual Context Network With Transformer for Hyperspectral Image Classification. IEEE Trans. Image Process. 2023, 32, 2536–2551. [Google Scholar] [CrossRef] [PubMed]
Zhu, H.; Long, M.; Wang, J.; Cao, Y. Deep Hashing Network for Efficient Similarity Retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Schuurmans, D., Wellman, M.P., Eds.; AAAI Press: Washington, DC, USA, 2016; pp. 2415–2421. [Google Scholar]
Cao, Z.; Long, M.; Wang, J.; Yu, P.S. HashNet: Deep Learning to Hash by Continuation. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 5609–5618. [Google Scholar] [CrossRef]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 1–23. [Google Scholar] [CrossRef] [PubMed]
Jing, C.; Dong, Z.; Pei, M.; Jia, Y. Heterogeneous Hashing Network for Face Retrieval Across Image and Video Domains. IEEE Trans. Multim. 2019, 21, 782–794. [Google Scholar] [CrossRef]
Zhang, Z.; Zou, Q.; Lin, Y.; Chen, L.; Wang, S. Improved Deep Hashing With Soft Pairwise Similarity for Multi-Label Image Retrieval. IEEE Trans. Multim. 2020, 22, 540–553. [Google Scholar] [CrossRef]
Li, T.; Zhang, Z.; Pei, L.; Gan, Y. HashFormer: Vision Transformer Based Deep Hashing for Image Retrieval. IEEE Signal Process. Lett. 2022, 29, 827–831. [Google Scholar] [CrossRef]
Song, J.; He, T.; Gao, L.; Xu, X.; Hanjalic, A.; Shen, H.T. Binary Generative Adversarial Networks for Image Retrieval. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018; McIlraith, S.A., Weinberger, K.Q., Eds.; AAAI Press: Washington, DC, USA, 2018; pp. 394–401. [Google Scholar]
Zhang, J.; Zhang, J. An Analysis of CNN Feature Extractor Based on KL Divergence. Int. J. Image Graph. 2018, 18, 1850017:1–1850017:20. [Google Scholar] [CrossRef]
Guérin, J.; Thiery, S.; Nyiri, E.; Gibaru, O.; Boots, B. Combining pretrained CNN feature extractors to enhance clustering of complex natural images. Neurocomputing 2021, 423, 551–571. [Google Scholar] [CrossRef]
Ng, W.W.Y.; Li, J.; Tian, X.; Wang, H. Bit-wise attention deep complementary supervised hashing for image retrieval. Multim. Tools Appl. 2022, 81, 927–951. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; ICLR: San Diego, CA, USA, 2015. [Google Scholar]
Liu, H.; Wei, Y.; Yin, J.; Nie, L. HS-GCN: Hamming Spatial Graph Convolutional Networks for Recommendation. IEEE Trans. Knowl. Data Eng. 2022, 36, 5977–5990. [Google Scholar] [CrossRef]
Xu, R.; Li, C.; Yan, J.; Deng, C.; Liu, X. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; Kraus, S., Ed.; IJCAI: Macao, China, 2019; pp. 982–988. [Google Scholar] [CrossRef]
Zhou, X.; Shen, F.; Liu, L.; Liu, W.; Nie, L.; Yang, Y.; Shen, H.T. Graph Convolutional Network Hashing. IEEE Trans. Cybern. 2020, 50, 1460–1472. [Google Scholar] [CrossRef] [PubMed]
Lu, X.; Zhu, L.; Liu, L.; Nie, L.; Zhang, H. Graph Convolutional Multi-modal Hashing for Flexible Multimedia Retrieval. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, China, 20–24 October 2021; Shen, H.T., Zhuang, Y., Smith, J.R., Yang, Y., Cesar, P., Metze, F., Prabhakaran, B., Eds.; ACM: New York, NY, USA, 2021; pp. 1414–1422. [Google Scholar] [CrossRef]
Cao, Y.; Long, M.; Liu, B.; Wang, J. Deep Cauchy Hashing for Hamming Space Retrieval. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1229–1237. [Google Scholar] [CrossRef]
Liu, L.; Shao, L.; Shen, F.; Yu, M. Discretely Coding Semantic Rank Orders for Supervised Image Hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 5140–5149. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; ICLR: San Diego, CA, USA, 2015. [Google Scholar]
Lu, H.; Zhang, M.; Xu, X.; Li, Y.; Shen, H.T. Deep Fuzzy Hashing Network for Efficient Image Retrieval. IEEE Trans. Fuzzy Syst. 2021, 29, 166–176. [Google Scholar] [CrossRef]
Liong, V.E.; Lu, J.; Tan, Y.; Zhou, J. Cross-Modal Deep Variational Hashing. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 4097–4105. [Google Scholar] [CrossRef]
Tu, R.; Mao, X.; Kong, C.; Shao, Z.; Li, Z.; Wei, W.; Huang, H. Weighted Gaussian Loss based Hamming Hashing. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, China, 20–24 October 2021; Shen, H.T., Zhuang, Y., Smith, J.R., Yang, Y., Cesar, P., Metze, F., Prabhakaran, B., Eds.; ACM: New York, NY, USA, 2021; pp. 3409–3417. [Google Scholar] [CrossRef]
Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014—13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V; Lecture Notes in Computer Science. Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Chua, T.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the 8th ACM International Conference on Image and Video Retrieval, CIVR 2009, Santorini Island, Greece, 8–10 July 2009; Marchand-Maillet, S., Kompatsiaris, Y., Eds.; ACM: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Liu, H.; Wang, R.; Shan, S.; Chen, X. Deep Supervised Hashing for Fast Image Retrieval. Int. J. Comput. Vis. 2019, 127, 1217–1234. [Google Scholar] [CrossRef]
Passalis, N.; Tefas, A. Deep supervised hashing using quadratic spherical mutual information for efficient image retrieval. Signal Process. Image Commun. 2021, 93, 116146. [Google Scholar] [CrossRef]
Yuan, L.; Wang, T.; Zhang, X.; Tay, F.E.H.; Jie, Z.; Liu, W.; Feng, J. Central Similarity Quantization for Efficient Image and Video Retrieval. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 3080–3089. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the Machine Learning, Twenty-Third International Conference (ICML 2006), Pittsburgh, PA, USA, 25–29 June 2006; ACM International Conference Proceeding Series. Cohen, W.W., Moore, A.W., Eds.; ACM: New York, NY, USA, 2006; Volume 148, pp. 233–240. [Google Scholar] [CrossRef]

Figure 1. The illustration of image retrieval.

Figure 2. The illustration of DSCEH. The first step of training is to train the hash model, and the second step, index and retrieval, is to use the trained hash model for image retrieval.

Figure 3. Example views of the to-be-retrieved images on (a) CIFAR-10, (b) MSCOCO, and (c) NUSWIDE.

Figure 4. The PR curve for image retrieval using deep hash code of 32, 48, and 64 bits on three datasets.

Figure 5. The PR curve comparison with different feature extraction settings on CIFAR-10.

Figure 6. The PR curve comparison with different correlation optimization settings on CIFAR-10.

Table 1. Symbols and Definition.

Symbol	Definition
n	The number of the input image.
W, H, C	The width, height, and channels of image.
$X$	The matrix of input data, $R^{n \times W \times H \times C}$ .
c	The number of categories.
$Y$	The label of input data, $R^{n \times c}$ .
$L$	The predicted label vector, $R^{n \times c}$ .
S	The similarity matrix, $R^{n \times n}$ .
$F$	The hash function.
k	The length of hash bits.
$H$	The hash codes, $R^{n \times k}$ .
P	The length of the patch.
b	The number of the patch.
$x_{p}$	The patch vector, $R^{P \times P \times C}$ .
$x_{c l s}$	The class token, $R^{P \times P \times C}$ .
$x_{e m d}$	Patch embedding vector, $R^{(b + 1) \times (P \times P \times C)}$ .
$Pos$	Image’s position information, $R^{(b + 1) \times (P \times P \times C)}$ .
$x_{p o s}$	Position embedding vector, $R^{(b + 1) \times (P \times P \times C)}$ .
$S_{a d j}$	The similarity adjacency matrix, $R^{n \times n}$ .
$F_{l o c a l}$	The local feature of image.
$F_{g l o b a l}$	The global representation of image.
$F_{f u s e d}$	The fused feature of image.
$L_{s i m}$	The Similarity loss.
$L_{s e m}$	The Semantic loss.
$D_{i j}$	The Hamming distance.

Table 2. Data statistics of three datasets.

Datasets	#Label	#Train	#Query	#Retrieval
CIFAR-10	10	5000	1000	54,000
MSCOCO	80	10,000	5000	117,218
NUSWIDE	21	10,500	2100	197,374

Table 3. Mean Average Precision (mAP) of deep-hash-based retrieval on three datasets. We take the best results with bold for the mark. Our method performs with the best value of mAP on three datasets.

Datasets	CIFAR-10@54000			MSCOCO@5000			NUSWIDE@5000
Methods	32 bits	48 bits	64 bits	32 bits	48 bits	64 bits	32 bits	48 bits	64 bits
HashNet	0.7686	0.8031	0.7914	0.6801	0.7108	0.7093	0.8237	0.8453	0.8466
DCH	0.7558	0.7455	0.7781	0.6536	0.6540	0.6583	0.7931	07942	0.7865
DSH	0.4255	0.4049	0.4469	0.4314	0.4383	0.4501	0.5627	0.5885	0.5755
IDHN	0.7920	0.7851	0.7909	0.6733	0.6798	0.6802	0.8097	0.8146	0.8189
QSMIH	0.7834	0.8127	0.8157	0.6670	0.6772	0.6877	0.8256	0.8348	0.8410
DSCEH	0.9824	0.9863	0.9946	0.7851	0.7974	0.8007	0.8642	0.8678	0.8761

Table 4. The mAP comparison with different feature extraction and correlation optimization on CIFAR-10.

Discussion	Settings	32 bits	48 bits	64 bits
Origin	VIT+CNN+GCN	0.9824	0.9863	0.9946
Feature	VIT+GCN	0.9147	0.9334	0.9628
Feature	CNN+GCN	0.8185	0.9262	0.8736
Correlation	CNN+VIT	0.8341	0.7121	0.7331

Table 5. The mAP comparison with different number of VIT Encoder on CIFAR-10.

#VIT Encoder	32 bits	48 bits	64 bits
6	0.9824	0.9863	0.9946
9	0.9892	0.9953	0.9971
12	0.9889	0.9962	0.9981

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Chen, H.; Liu, R.; Liu, S.; Zhan, Y.; Hu, C.; Shi, R. DSCEH: Dual-Stream Correlation-Enhanced Deep Hashing for Image Retrieval. Mathematics 2024, 12, 2221. https://doi.org/10.3390/math12142221

AMA Style

Yang Y, Chen H, Liu R, Liu S, Zhan Y, Hu C, Shi R. DSCEH: Dual-Stream Correlation-Enhanced Deep Hashing for Image Retrieval. Mathematics. 2024; 12(14):2221. https://doi.org/10.3390/math12142221

Chicago/Turabian Style

Yang, Yulin, Huizhen Chen, Rongkai Liu, Shuning Liu, Yu Zhan, Chao Hu, and Ronghua Shi. 2024. "DSCEH: Dual-Stream Correlation-Enhanced Deep Hashing for Image Retrieval" Mathematics 12, no. 14: 2221. https://doi.org/10.3390/math12142221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSCEH: Dual-Stream Correlation-Enhanced Deep Hashing for Image Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Deep Hashing for Image Retrieval

2.2. Graph Convolutional Network for Data Retrieval

2.3. Overview

3. DSCEH Framework

3.1. Feature Extraction

3.1.1. CNN Branch

3.1.2. Transformer Branch

3.1.3. Feature Fusion Unit

3.2. Graph-Based Correlation Optimization

3.3. Learning to Preserve the Similarity Hash

3.3.1. Similarity Loss L s i m

3.3.2. Semantic Loss L s e m

3.3.3. Model Optimization

3.4. Hash Model Application for Image Retrieval

4. Experimentation

4.1. Datasets

4.2. Experiment Settings

4.3. Compared Methods

4.4. Evaluation Criteria

4.5. Experimental Results and Discussion

4.6. Ablation Study

4.6.1. On Feature Extraction

4.6.2. On Correlation Optimization

4.7. Impact on the Number of Vit Encoders

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. Similarity Loss $L_{s i m}$

3.3.2. Semantic Loss $L_{s e m}$