1. Introduction
In recent years, due to the rapid development of multimedia Internet of Things technologies, there has been an explosive growth in the amount of multimedia network data. Consequently, the current unimodal search methods can no longer meet the multimedia data retrieval requirements in the complex environment of the new information era. Therefore, cross-modal retrieval methods [
1,
2,
3] have received increasing attention from the information retrieval community and have become a hot research topic in both academia and industry. Specifically, given a query in one modality (such as text), users expect to return its semantically related modality (text) or different modalities (such as images or videos). For decades, as a branch of
nearest
neighbor
search (NNS), the hashing technique has been an active research field in information retrieval community due to the following advantages: (1) Lower storage cost and (2) improved retrieval speed with the hardware-friendly bit-wise XOR and bit-count operations [
4]. In the hash code learning process, the learned hash codes should meet a condition, that is, similar instances have similar hash codes in the Hamming space, and vice versa. Among the practical applications are the image retrieval [
5,
6] and person re-identification [
7,
8].
According to the learning principle, existing cross-modal hashing methods can be mainly divided into the following categories:
Unsupervised cross-modal hashing methods [
9,
10,
11,
12,
13,
14,
15]: Unsupervised hashing methods focus on discovering the intra- and inter-relations of multiple heterogeneous modalities without label information to learn the hash codes and corresponding hash functions. However, due to the lack of label information, the retrieval performance of unsupervised cross-modal hashing methods will be affected accordingly, i.e.,
where
can be considered as either a hash mapping function or an encoder,
can be regarded as a decoder. During the unsupervised learning process, the key design factor of this learning paradigm is the choice of a suitable metric that can measure the distance between
and
, i.e., the distance between
should be minimized:
, where typically
. Then, the hash codes can be computed by
.
Supervised cross-modal hashing methods [
16,
17,
18,
19,
20,
21,
22]: Supervised hashing methods have obtained satisfactory retrieval results by using label information, and have been extensively studied, i.e.,
where
denotes a hash mapping function that selects certain latent representation
,
denotes a classifier,
denotes the element-wise sign operation,
denotes the labels, and
denotes the learned hash codes. Then, the hash codes can be computed by
.
Although supervised cross-modal hashing methods have achieved significant success, they still have the following limitations: (1)
Limited semantic utilization. Converting the label matrix directly to the similarity matrix will lead to a semantic loss, resulting in a large semantic gap, especially when facing multi-label datasets. (2)
Inefficient learning strategy. Some cross-modal hashing methods are based on symmetric learning strategies, resulting in a worse retrieval performance than asymmetric learning ones. (3)
Flawed optimization strategy. As the optimization process of the hash codes is discrete, the existing optimization strategies are mainly based on two kinds, one is to use the relaxation-based strategy, which will lead to a large quantization error; the other is to use bit-to-bit optimization strategy, such as Discrete Cyclic Coordinate (DCC) descent [
23]. Although the problem of quantization error is solved, optimizing the entire hash code requires
k iterations, where
k is the hash code length, thus the optimization process is very time-consuming.
In order to solve the above limitations, in this paper, we proposed a novel yet simple but effective method, named Discrete Semantics-Guided Asymmetric Hashing (DSAH). Specifically, DSAH handles the nonlinear relations in different modalities with a kernelization technique, then an asymmetric learning scheme is proposed to effectively perform the hash function learning and hash code learning processes; meanwhile, our proposed DSAH considers the following aspects. First, we leverage both label information and similarity matrix to enhance the semantic information of the learned hash codes. Then, we solve the problems of matrix sparsity and outlier processing. In addition, a discrete optimization algorithm is proposed to solve the discrete problems. Our major contributions can be summarized as follows:
- 1.
A novel supervised cross-modal hashing method, i.e., DSAH, is proposed to learn the discriminative compact hash codes for large-scale retrieval tasks. DSAH takes the label information and similarity matrix into consideration, which can improve the discriminative capability of the learned hash codes, and solves the problems of matrix sparseness and outlier processing.
- 2.
An asymmetric learning scheme with real-valued embeddings is proposed to effectively learn the hash function and the hash codes.
- 3.
Comprehensive experiments are conducted on two famous datasets. The experimental results demonstrate that DSAH outperforms some state-of-the-art baselines.
The remainder of this paper is organized as follows.
Section 2 briefly reviews the related works.
Section 3 introduces the details of the DSAH model and presents the alternative optimization algorithm. In
Section 4, we give the results of experiments performed. Finally, we present the conclusions in
Section 5.
3. The Proposed DSAH Framework
In this section, we introduce our proposed DSAH model. The framework of DSAH is shown in
Figure 1, which consists of three main parts: hash function learning, label alignment scheme and asymmetric learning framework. We demonstrate each part in the following section in detail.
3.1. Definitions
Suppose that the multimedia training data contains
M modalities, represented by
, where
is the
m-th modality data features and
is the dimension of modality
. In this paper, we focus on the supervised hashing paradigm; therefore, label information
is available,
c denotes the number of categories, and
indicates the
j-th instance belongs to category
i, and
otherwise.
denotes the hash codes, where
k is the length of hash codes.
denotes the hash function. The main notations used in this paper are listed in
Table 1.
3.2. Hash Function Learning
3.2.1. Kernelization
Kernelization is a widely used technique to handle the nonlinear relations in different modalities. Therefore, in this paper, we use Radical Basis Function (RBF) kernel to express the nonlinear correlations among original high dimensional features [
49,
50,
51]. Specifically, we define the RBF function
as follows:
where
denotes the randomly chosen
q anchors from the database and
is the free parameter. Therefore, the complex original feature
can be converted into a nonlinear relation representation
.
3.2.2. Feature Mapping
The aim of DSAH is to project the original features to compact hash codes. In this paper, we adopt two linear projections as the hash mapping functions for image modality
and text modality
, respectively.
where
and
are the hash mapping matrices, which map specific kernel features into Hamming subspace, and
and
are hash functions for image modality and text modality, respectively.
3.3. Label Alignment Scheme
As described above, labels contain rich semantic information, directly converting the complex label vectors into binary semantic matrix will cause the loss of semantic information. The results of Gui’s work [
52] demonstrate that the ordinary least squares regression is sensitive to the boundary contour. Inspired by the work in [
53], we consider
norm instead of
norm to handle the problem, the
norm can be formulated as
where
and
is the semantic projection matrix. It is easy to find that when
, Equation (
3) is a standard Frobenius norm. However, in order to improve the robustness of the model for outliers and the sparsity of the label alignment matrix, we need to redefine the values of
. In general, the sparsity of the model can be guaranteed when the constraint conditions satisfy
and
. Therefore, in the paper, we set
and
, as if
, the problem is not convex. Then, we can rewrite the Equation (
3) as
. After some algebraic manipulations, we obtain
where
is the diagonal matrix, and the
i-th element of
is defined as
, where
is the
i-th row of
.
3.4. Asymmetric Learning Framework
We briefly review the related work Supervised Hashing with Kernels (KSH) [
51], the symmetric learning framework can be formulated as
where
is the learned hash codes. However, there are two major problems of Equation (
5): (1) It is very time-consuming to directly compute
, as the similarity information
is a
matrix. (2) Some works [
54,
55] show that the use of an asymmetric learning framework can not only solve the problem of high time consumption, but also improves retrieval accuracy, because the value range of the asymmetric learning framework is wider than that of symmetric learning. Therefore, in this paper, we construct an asymmetric learning framework to learn the compact hash codes, that is,
The advantages of Equation (
6) are as follows:
- 1.
The learning mode uses an efficient asymmetric learning architecture instead of a time-consuming symmetric one.
- 2.
The use of the real-valued embeddings instead of the binary embeddings produces a close semantic similarity relation, and the value of the objective function is smaller.
- 3.
The last term is used to reduce the quantization errors, which leads to a better retrieval performance.
However, there is a limitation of Equation (
6) that is for the purpose of cross-modal retrieval tasks, we need to obtain a unified hash codes. Therefore, we need to consider another discrete constraint, i.e.,
. In order to make the optimization easy, we set a unified hash code
instead of minimizing the constraint
, then Equation (
6) can be rewritten as
where
is the balance parameter.
3.5. The Joint Framework
Combining the above constraints and individual objective function, we obtain
where
is the trade-off parameter,
,
is the Frobenius norm regularization term, which is used to prevent overfitting.
3.6. Optimization
In this part, we use an alternating strategy to solve the four variables
in Equation (
8), as the four variables are coupled with each other. The problem is split into four steps as follows.
Fix ,
update . The sub-problem of Equation (
8) related to
can be formulated as
In the next step, we need to solve the following problem:
where
. As the
is the discrete value, it is challenging to directly solve the value of
. In this solution, we propose an augmented Lagrangian multiplier (ALM) [
39] to compute
. Specifically, we introduce an auxiliary value
to replace the
of second term, i.e.,
. Then, we obtain the following formula:
where
measures the gap between
and
.
Then, the value of
can be solved with a closed-form solution:
However, the computational complexity of
is
, which is not suitable for large-scale retrieval tasks. To address this problem, we use the label matrix
to replace the similarity matrix
. Specifically, we let
, as the element at the
i-th row and the
j-th column in the matrix
. Then, the similarity matrix
can be rewritten as
where
is a vector with all elements being 1. Therefore, we can rewrite the term
as
which consumes
.
Fix ,
update . The sub-problem related to
can be formulated as
Then, the value of
can be solved with a closed-form solution,
Update . The sub-problem related to
can be updated as
where
is a parameter to control the convergence speed.
Fix ,
update . The sub-problem of Equation (
8) related to
can be formulated as
In the next step, we need to solve the following problem:
Setting the derivative Equation (
20) w.r.t
to 0, we obtain
We transform Equation (
20) into
Then, it can be seen that Equation (
21) is a Sylvester equation. Therefore, the value of
can be easily solved. Due to the space limitation, the detail about the solution is not given here.
Fix ,
update . The sub-problem of Equation (
8) related to
can be formulated as
Setting the derivative Equation (
22) w.r.t
to 0, we obtain
Then, the value of
can be solved with a closed-form solution:
where
is also transformed using Equation (
13); then, we have
which consumes
.
Fix ,
update . The sub-problem of Equation (
8) related to
can be formulated as
It is easy to find that the optimization of
is almost identical to
-subproblem. Then, the value of
can be solved with a closed-form solution:
Moreover, the terms of and are constants and can be computed once before the iterative optimization.
The objective function is solved by iteratively updating four variables until the objective function converges or reaches the preset maximum number of iterations. The iterative optimization for solving the Equation (
8) is summarized in Algorithm 1.
Algorithm 1:DSAH |
Input: Training modalities , labels , hash code length k, parameter , maximum iteration number T. Output: Hash mapping functions and . Procedure: 1. Centralize by means. 2. Computing Kernelized features . 3. Initialize by random initialization. 4. Repeat -Step: Update according to ( 12). -Step: Update according to ( 16). -Step: Update according to ( 17). -Step: Update according to ( 21). -Step: Update according to ( 24). -Step: Update according to ( 27). Until up to T.
|
3.7. Out-of-Sample Extension
In the query phase, the proposed DSAH can easily map the original high-dimensional instances into compact hash codes. Specifically, given a new query
, DSAH learns its corresponding hash codes by
where
is the nonlinear kernelized embedding of
.
3.8. Complexity Analysis
For each iteration, the time complexity is analyzed as follows. The time computational complexity of is , is , is , and are all . As , the training complexity is . Given the iteration T, the overall training complexity for DSAH is , where is very small, which is linear to the training set size. Therefore, DSAH is highly scalable for large-scale cross-modal retrieval tasks.
5. Conclusions
In this paper, we present a novel cross-modal hashing method, named DSAH, for large-scale cross-modal retrieval. In detail, to enhance the feature representation in the linear model, we handle the nonlinear relations with a kernelization technique. Meanwhile, DSAH incorporates the label information and semantic matrix into the learning process. Therefore, DSAH can obtain more semantic information to improve the discriminative capability of the learned hash codes. However, due to the inevitable noise and subjective factors in labels for large-scale dataset, the norm is used to sparse the matrix and effectively deal with outliers. In addition, a discrete optimization algorithm is proposed to solve the quantization errors and improve the optimization efficiency. Extensive experiments on two datasets demonstrate the superiority of DSAH on cross-modal retrieval tasks.