Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1 s2.0 S1566253518305505 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Information Fusion 54 (2020) 99–118

Contents lists available at ScienceDirect

Information Fusion
journal homepage: www.elsevier.com/locate/inffus

IFCNN: A general image fusion framework based on convolutional neural


network
Yu Zhang a,∗, Yu Liu b, Peng Sun c, Han Yan a, Xiaolin Zhao d, Li Zhang a,∗
a
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
b
Department of Biomedical Engineering, Hefei University of Technology, Hefei 230009, China
c
Beijing Aerospace Automatic Control Institute, Beijing 100854, China
d
School of Aeronautics and Astronautics Engineering, Airforce Engineering University, Xi’an 710038, China

a r t i c l e i n f o a b s t r a c t

Keywords: In this paper, we propose a general image fusion framework based on the convolutional neural network, named as
General image fusion framework IFCNN. Inspired by the transform-domain image fusion algorithms, we firstly utilize two convolutional layers to
Convolutional neural network extract the salient image features from multiple input images. Afterwards, the convolutional features of multiple
Large-scale multi-focus image dataset
input images are fused by an appropriate fusion rule (elementwise-max, elementwise-min or elementwise-mean),
Better generalization ability
which is selected according to the type of input images. Finally, the fused features are reconstructed by two
convolutional layers to produce the informative fusion image. The proposed model is fully convolutional, so it
could be trained in the end-to-end manner without any post-processing procedures. In order to fully train the
model, we have generated a large-scale multi-focus image dataset based on the large-scale RGB-D dataset (i.e.,
NYU-D2), which owns ground-truth fusion images and contains more diverse and larger images than the existing
datasets for image fusion. Without finetuning on other types of image datasets, the experimental results show that
the proposed model demonstrates better generalization ability than the existing image fusion models for fusing
various types of images, such as multi-focus, infrared-visual, multi-modal medical and multi-exposure images.
Moreover, the results also verify that our model has achieved comparable or even better results compared to the
state-of-the-art image fusion algorithms on four types of image datasets.

1. Introduction In general, the image fusion algorithms could be divided in two cate-
gories [2,17], i.e., spatial-domain algorithms and transform-domain al-
The target of image fusion is to integrate the salient features of gorithms. The spatial-domain image fusion algorithms [9,17–19] firstly
multiple input images into one comprehensive image [1–4]. Nowadays, parse the input images into small blocks or regions according to some
image fusion has become more closely related to our daily lives and criterion, then measure the saliency of the corresponding regions, and
played more important role in the industrial field and military field. finally combine the most salient regions to form the fusion image. This
For instance, the mobile phones are often integrated with HDR (High kind of algorithms are mainly suitable for fusing images of the same
Dynamic Range) [5–7] or refocusing algorithms [8–10] to enable modality (such as multi-focus images), and probably suffer from block
us to capture satisfactory and informative pictures, where HDR and or region artifacts around the stitching positions. On the other hand, the
refocusing are essentially image fusion algorithms. In hospitals, the transform-domain image fusion algorithms [3,20–25] firstly transform
surgeons diagnose diseases of patients by inspecting multiple modalities the source images into some feature domain through multi-scale geomet-
of medical images (such as computed tomography (CT) image and ric decomposition (such as multi-scale pyramids and multi-scale mor-
magnetic resonance (MR) image), and especially they determine the phological operators), and then perform weighted fusion on the features
precise boundaries of bone tumors according to the fused CT and of multiple input images. Afterwards, the fused features are inversely
MR images [11,12]. In the military or civil surveillance systems, transformed to produce the fusion image. Since in the feature domain,
fusion of the infrared and visual images could bring the observers great even the images of different modalities would share the similar property,
convenience to fully learn about the supervised environment [2,13–16]. thus the transform-domain image fusion algorithms could be generally


Corresponding authors.
E-mail addresses: uzeful@163.com (Y. Zhang), chinazhangli@mail.tsinghua.edu.cn (L. Zhang).
URL: https://uzeful.github.io (Y. Zhang)

https://doi.org/10.1016/j.inffus.2019.07.011
Received 6 August 2018; Received in revised form 3 March 2019; Accepted 24 July 2019
Available online 24 July 2019
1566-2535/© 2019 Elsevier B.V. All rights reserved.
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

used to fuse more types of images, such as infrared-visual images and using several filters (such as Gaussian filters or morphological fil-
CT-MR images. Moreover, this kind of algorithms have achieved great ters) at the beginning, and the CNN models also extract extensive
success in the past two decades. However, the fusion strategy or weight features using large number of convolutional filters. Secondly, the
coefficients of the transform-domain algorithms are often hard to opti- transform-domain fusion algorithms usually fuse the features through
mize for the fusion purpose, and thus probably could not achieve the op- the weighted-average strategy, and the CNN models also utilize the
timal fusion results and suffer from low-contrast effect or blurring effect. weighted-average strategy (weighted sum of the convolutional features)
In recent years, the machine learning algorithms have been widely to generate the target image. Compared to the transform-domain image
used in completing different kinds of image fusion tasks, and achieved fusion algorithms, the CNN models have three advantages: (1) the
great success in the image fusion field. At beginning, Yang et al. [26] em- number of convolutional filters is usually much greater than that of the
ployed the sparse representation technique to fuse multi-focus images, filters in the conventional transform-domain algorithms, and thus the
in which the image patches were represented with an overcomplete convolutional filters could extract more informative image features; (2)
dictionary and corresponding sparse coefficients, and then the input the proper parameters of convolutional filters can be learned to fit the
images were fused through fusing the sparse coefficients of each pair image fusion task; (3) the parameters of the CNN models can be jointly
or set of image patches. In the following, a sequence of sparse repre- optimized through training them in the end-to-end manner.
sentation based algorithms [27–29] appeared to further improve the Inspired by the transform-domain algorithms, we proposed a general
algorithms’ performance and extend to fuse more types of images (such image fusion framework based on the convolutional neural network,
as multi-modal medical images and infrared-visual images). architecture of which in the training phase has been shown in Fig. 1.
More recently, deep learning techniques, especially convolutional Firstly, we used two convolutional layers to extract the informative
neural network (CNN), have brought new evolution into the field low-level features from multiple input images. Secondly, the extracted
of image fusion [30]. Firstly, Liu et al. [31] introduced CNN to fuse convolutional features of each input image were elementwisely fused
multi-focus images. They formulated multi-focus image fusion as a by the appropriate fusion strategy, such as elementwise-maximum and
classification task and used CNN to predict the focus map, as each pair elementwise-mean. Finally, the integrated features were reconstructed
of image patches could be classified into two categories: (1) first patch by two convolutional layers to produce the fusion image. As the
was focused and second blurred and (2) first patch was blurred and proposed model is fully convolutional, thus it could be trained in the
second focused. In [32], Tang et al. proposed a CNN model to learn the end-to-end manner with any post-processing procedures, which is one
effective focus-measure (i.e., metric for quantifying the sharpness de- superior advantage compared to most of the existing image fusion
gree of an image or image patch) and then compared the focus-measures models. Furthermore, in order to fully train the proposed model, we
of local image patch pairs of input images to determine the focus map. have created a large-scale multi-focus image dataset by blurring por-
Afterwards, the above two algorithms both post-processed the focus tions of images from our prebuilt NYU-D2 dataset [35] according to the
maps and reconstructed the fusion images according to the refined focus random depth ranges, which is more reasonable than blurring whole or
maps. In [33], Song et al. applied two CNNs to fuse the spatiotemporal certain portions of image patches in [31,32]. Compared to the previous
satellite images, i.e., large-resolution MODIS and low-resolution land- datasets for training image fusion models, the resolution (224 × 224)
sat images. Specifically, they respectively used two CNNs to perform of our dataset is much larger than those (16 × 16, 32 × 32 and 64 × 64)
super-resolution on the low-resolution landsat images and extract image of datasets in [31,32,34], and the source RGB images in NYU-D2
features, and then adopted high-pass modulation and weighting strategy dataset can be taken as the ground-truth fusion images of our dataset
to reconstruct the fusion image from the extracted features similar to the which is much better than no ground-truth fusion images in datasets of
transform-domain image fusion algorithms [15]. However, the above [31,32,34]. Thanks to the above merits, our high-resolution large-scale
three algorithms were not designed in the end-to-end manner and all re- multi-focus image dataset can be used to finely train the image fusion
quired post-processing procedures to produce fusion images, thus their models. During the training phase, we firstly adopted the mean square
models might have not been fully optimized for the image fusion task. error (MSE) of the fusion image and the ground-truth fusion image
In [34], Prabhakar et al. proposed an end-to-end multi-exposure fusion to pretrain the model’s parameters, and then equipped the perceptual
model. Specifically, they firstly used CNN to fuse the intensity channel loss (mean square error of the deep convolutional features of the pre-
(Y channel in the YCbCr color space) of the multiple input images, then dicted fusion image and ground-truth fusion image) with MSE to jointly
leveraged the contrast-enhancement method to adjust the fused inten- optimize the model’s parameters. The appropriate model archieteture,
sity channel, and afterwards employed the weighted-average strategy well generated multi-focus image dataset and superior loss function,
to respectively fuse the Cb and Cr channels. Finally, the fused channels together guarantee our algorithm to achieve good performance on the
(Y, Cb and Cr) were stacked together to produce the fusion image. Their image fusion task. Moreover, the extensive experimental results show
model could be trained end-to-end and could be applied to fuse other that the proposed model can well fuse multiple types of images without
types of images, such as multi-focus images. However, their results on any finetuning procedure, and meanwhile achieve comparable or even
the multi-focus image dataset appear to suffer from low-contrast effect. better performance than the state-of-the-art image fusion algorithms.
Even though the CNN models have achieved some success in the To sum up, the contributions of this paper are fourfold:
field of image fusion, the current models lack the generalization ability
and could only perform well on one specific type of images. This • In this paper, the image fusion task is formulated as a fully
problem will however bring us great difficulty in developing the CNN convolutional neural network, thus the proposed image fusion
based algorithms for fusing images without ground-truth images (such model can be trained in the end-to-end fashion so that all
as infrared-visual images and CT-MR images). Moreover, most of the parameters of the proposed model could be jointly optimized for
proposed CNN models are not designed in the end-to-end manner, the image fusion task without any post-processing procedures.
and thus require additional procedures to complete the image fusion Based on the proposed CNN based image fusion framework, the
task. Overall, the CNN based models have not been fully exploited for researchers can conveniently develop their own image fusion
the image fusion task, thus there is still much space to improve the models for fusing various types of images.
architectures of the CNN based image fusion models, so as to increase • To fully train the model’s parameters, we have generated a
their performance and generalization ability. large-scale multi-focus image dataset. Instead of creating low-
Through comparing the transform-domain image fusion algorithms resolution pairs of fully focused and fully blurred image patches,
and CNN based image generation models, we find there are several we have generated high-resolution pairs of partially-focused
similar characteristics between these two kinds of algorithms. Firstly, images by blurring image portions of random depth ranges
the transform-domain algorithms usually extract the image features from the RGB and depth images in our prebuilt RGB-D dataset.

100
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 1. The proposed general image fusion framework based on convolutional neural network. The above part illustrates the architecture of our image fusion model,
and the below part shows a demonstration example for fusing multi-focus images. Please note that the spatial sizes marked in the figure just indicate the ones used
in our training phase, and the inputs can be extended to more than two images.

Compared to the existing multi-focus image generation methods, this paper. The reasons are as follows: (1) To the best of our knowledge,
our method is more close to the imaging principle of optical lens, there is still no fully convolutional neural networks based image fusion
therefore the multi-focus images generated by our method are model that can achieve state-of-the-art fusion images on multiple
more natural and diverse than the pairs of fully focused and fully types of images without any finetuning procedure as our model does,
blurred image patches. Moreover, the RGB source images can be and (2) in the field of deep learning, the quality of training dataset
naturally taken as the ground-truth fusion images of the gener- often directly determines the upper limit of the model’s performance,
ated multi-focus image dataset, which is of great importance for thus our high-resolution large-scale multi-focus image dataset (with
supervising the image fusion models (i.e., regression models) to ground-truth fusion images) is more superior for fully training the
transfer the salient details from multiple inputs into one fusion image fusion models than the existing low-resolution large-scale image
image. Owning to these merits, our multi-focus image dataset datasets (without ground-truth fusion images). Therefore, either of the
can be used to fully and finely train the image fusion models. two major novelties can make the proposed image fusion model stand
• Owing to the similar structure with the transform-domain image out from the existing CNN based image fusion models. The rest of
fusion algorithms, our model owns better generalization ability this paper is organized as follows. In Section 2, the proposed method
than the existing CNN models for fusing various types of images. including our image fusion model and training dataset is introduced in
Although the proposed model has been trained only on the detail. The extensive experimental results and discussions are described
multi-focus image dataset, it has well learned the ability to fuse in Section 3. Finally, the conclusions are drawn in Section 4.
the convolutional features of multiple images of the same type
or even different types. Therefore, our model could be directly
2. Proposed method
applied to fuse other types of images (such as infrared-visual,
CT-MR and multi-exposure images) without any finetuning
2.1. Overview
procedures, and still achieve state-of-the-art results.
• To the best of our knowledge, it is the first time to introduce
In the field of computer vision, the convolutional layer plays the
perceptual loss in training the CNN based image fusion model.
role of feature extraction, and usually could extract more extensive and
The chief reason is that computation of perceptual loss requires
informative features than the traditional handcrafted feature extractors
ground-truth fusion images, which however have not been gen-
[35,36]. In addition, the convolutional layer also plays the role of
erated in the existing image datasets for training image fusion
weighted average for producing the output image. These characteristics
models. Through introducing the perceptual loss, the trained im-
of the convolutional layer are quite similar to the transform-domain
age fusion model could produce fusion images with more textural
image fusion algorithms, thus the convolutional neural network has
information than those without incorporating perceptual loss.
great potential to achieve success in the field of image fusion.
In our opinion, there are two major novelties in this paper. Firstly, Inspired by the framework of the transform-domain image fusion
our model’s characteristics of fully convolutional neural network and algorithms, we have designed a general image fusion framework based
good generalization ability together compose the first major novelty of on the convolutional neural network, which has been abbreviated as
this paper. Secondly, our high-resolution large-scale multi-focus image IFCNN hereinafter. IFCNN consists of three modules: feature extraction
dataset (with ground-truth fusion images) is another major novelty of module, feature fusion module and image reconstruction module, as

101
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

shown in Fig. 1. Firstly, we adopt two convolutional layers to extract Therefore, in our feature fusion module, the elementwise fusion
the informative image features. Secondly, the convolutional features of rules have been utilized to fuse the convolutional features of multiple
multiple input images are fused via the feature fusion module. Finally, inputs, which can be mathematically expressed as Eq. (1). As described
the fused features are reconstructed by two convolutional layers to above, there are three commonly used elementwise fusion rules, i.e.,
produce the fusion image. elementwise-maximum, elementwise-sum and elementwise-mean. In
In order to fully train our image fusion model, we have online practice, the fusion rule should be selected according to the character-
generated a large-scale multi-focus image dataset based on our prebuilt istics of the image dataset. For instance, the sharp features (maximum
NYU-D2 dataset [35,37], which consists of about 100,000 pairs of RGB values) indicate the salient objects of the supervised scene, thus the
and depth images. The focused and blurred portions of our multi-focus elementwise-maximum fusion rule has been often used in the transform-
image pairs are determined according to their depth ranges, thus the domain image fusion algorithms to fuse the multi-focus images, infrared
generation of our multi-focus image dataset is intuitive and reasonable. and visual images, and medical images. However, multi-exposure image
In addition, the source RGB images can be directly taken as the ground- fusion is to integrate the visually-pleasant middle-exposure portions
truth fusion images, which are important for fully training the regression of each input image, where most probably correspond to the mean
models for image fusion. Moreover, in order to more effectively train features of multiple inputs. Thus, at this time, elementwise-mean fusion
the proposed model, we have introduced the perceptual loss to regular- rule might be more suitable to fuse the multi-exposure images than
ize the proposed model to generate fusion images with more similarity elementwise-maximum fusion rule.
to the ground-truth fusion image. The details about the proposed image ( )
𝑗
𝑓̂𝑗 (𝑥, 𝑦) = 𝑓 𝑢𝑠𝑒 𝑓𝑖,𝐶 (𝑥, 𝑦) , 1 ≤ 𝑖 ≤ 𝑁, (1)
fusion model are introduced in the following subsections. 𝑖 2

𝑗
where 𝑓𝑖,𝐶 denotes the jth feature map of the ith input image extracted
2.2. Image fusion model 2
by CONV2, 𝑓̂𝑗 denotes the jth channel of fused feature maps by our fea-
In order to conveniently describe the proposed modules, we assume ture fusion module, and fuse denotes the elementwise fusion rule (such
that there are N (N ≥ 2) input images to fuse, denoted by Ik (1 ≤ k ≤ N). as elementwise-maximum, elementwise-sum and elementwise-mean).
Then, the three modules of the proposed image fusion model can be Hence, in this paper, we have used the elementwise-mean fusion rule
respectively detailed as follows. to fuse the multi-exposure images, and used the elementwise-maximum
fusion rule to fuse other types of images, including multi-focus images,
2.2.1. Feature extraction module infrared and visual images, and multi-modal medical images.
Firstly, we adopt two convolutional layers to extract the extensive
low-level features from the input images. As feature extraction is the 2.2.3. Image reconstruction module
crucial procedure in the transform-domain image fusion algorithms, and Because our feature extraction module only contains two convo-
is usually conducted by processing images with multi-scale DOG (Differ- lutional layers, thus abstraction level of the extracted convolutional
ence of Gaussian) [15], multi-scale morphological filters [3] and so on. features is not high. Therefore, in the final stage of our proposed
As for CNN, training the regression model (image-to-image) is usually model, we also adopt two convolutional layers (CONV3 and CONV4) to
hard and not stable from the random initialized convolutional kernels, reconstruct the fusion image from the fused convolutional features 𝑓̂.
and thus a practical way is to transfer the parameters of a well-trained
classification model to the regression model [35]. Thereby, in this 2.2.4. Model details
paper, we adopt the first convolutional layer of the superior ResNet101 As down-sampling feature maps will inevitably lose the source infor-
pretrained on ImageNet as our first convolutional layer (CONV1). As is mation of input images, which might affect the fusion image’s quality.
known, CONV1 contains 64 convolutional kernels of size 7 × 7, which Therefore, in our image fusion model, we have not down-sampled
are sufficient enough to extract extensive image features, and CONV1 the feature maps in any layer, and thus the size of feature maps has
has been trained on the largest natural image dataset (i.e., ImageNet). been kept same with that of input images throughout the model. To
Therefore, CONV1 can be used to extract the effective image features, satisfy the above condition while generating good fusion images, the
and thus we have fixed the parameters of CONV1 during training parameters of our image fusion model are set as follows.
the proposed model. However, the extracted features by CONV1 are Firstly, as the kernel size of CONV1 is 7 × 7, thus the stride and
originally used for the classification task, thus directly feeding them padding parameters of CONV1 are respectively set as 1 and 3. Secondly,
into the feature fusion module might be not appropriate for the image because CONV2 is used to tune the convolutional features of CONV1,
fusion task. Hence, we add the second convolutional layer (CONV2) to thus number of feature maps of CONV2 should be same with CONV1.
tune the convolutional features of CONV1 to suit for feature fusion. Therefore, the kernel number and kernel size of CONV2 are respectively
set as 64 and 3 × 3, and both stride and padding parameters of CONV2
2.2.2. Feature fusion module are set as 1. Thirdly, CONV3 also plays the role of tuning the fused
In this paper, our target is to propose a general image fusion model convolutional features after the feature fusion module, thus parameter
based on CNN, which can fuse various types of input images and also can settings are same with CONV2, i.e., kernel number and kernel size of
fuse various number of input images. In general, there are usually two CONV3 are respectively set as 64 and 3 × 3, and both stride and padding
methods to fuse the convolutional features of multiple inputs: (1) the parameters of CONV2 equal to 1. Finally, CONV4 plays the role of
convolutional features of multiple inputs are firstly concatenated along reconstructing feature maps into the 3-channel output, which is often
the channel dimension, and then the concatenated features are fused implemented by the elementwise weighted average [38]. Thereby, the
by the following convolutional layer, (2) the convolutional features of kernel number and kernel size of CONV4 are respectively set as 3 and
multiple inputs are directly fused by the elementwise fusion rules (such 1 × 1, and both stride and padding parameters of CONV4 are set as 0.
as elementwise-maximum, elementwise-sum and elementwise-mean). Moreover, in order to overcome the overfitting problem and boost
As the concatenation fusion method requires parameter number of the the training process, both two middle convolutional layers (CONV2
feature fusion module varies with the input number, thus the models and CONV3) have been equipped with ReLU activation layer [39] and
with this fusion method can only fuse a specific number of images once batch-normalization layer [40]. Because CONV1 has been well trained
the model architecture is fixed. While the feature fusion module with on ImageNet and thus does not need retraining, and the last convolu-
elementwise fusion method does not contain any parameter and can tional layer (CONV4) is usually not equipped with activation layer or
fuse various number of input images, and has ever been introduced in batch-normalization layer, thus we have not appended RELU layer and
the image fusion models [34]. batch-normalization layer after CONV1 and CONV4.

102
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Overall, our model is designed to fuse multiple RGB images and pro- from a small-scale of multi-exposure image set. Furthermore, due to
duce one RGB fusion image. Nevertheless, the proposed model can be lacking ground-truth fusion images, they trained the multi-exposure
conveniently extended to fuse the single-channel images by stacking image fusion model through an unsupervised method, i.e., using the
three same channels. Specifically, the RGB multi-focus images can be structural similarity loss (SSIM [41]).
directly fused by the proposed model, the infrared and visual images or As detailed above, the current datasets for training image fusion
multi-modal medical images should be firstly extended to three channels models mainly consist of small image patches (16 × 16, 32 × 32 and
and then could be fused by our model. Finally, fusion of the RGB multi- 64 × 64). Among the three existing datasets, the resolution of Liu et al.’s
exposure images is performed referred to [34]: (1) converting the RGB multi-focus image dataset is lowest, and only contains one type of image
input images to YCbCr color space, (2) for each input image, separating patch pairs (i.e., one patch is focused and another one blurred), which is
the YCbCr channels and stacking three Y channels as the input of our im- not appropriate for fully training the end-to-end image fusion models.
age fusion model, (3) using our model to fuse the three-channel Y images While the second multi-focus image dataset only consists of single
of all source images and convert the three-channel output to the single- blurred images and is designed to train the model’s ability for identify-
channel Y′ according to Eq. (2), (4) fusing Cb and Cr channels of all ing the focus types (i.e., focused, defocused, and unknown) of the image
source images by the same weighted strategy with Prabhakar et al., (5) patches, which is not suitable for training the end-to-end image fusion
stacking Y′, fused Cb and fused Cr together and converting it back to RGB network either. Finally, Prabhakar et al. took the randomly cropped
color space to produce the fusion image. Note that the input and output patches of size 64 × 64 from a small-scale sets of multi-exposure images
of Prabhakar et al.’s method and those of ours are a little different, i.e., as their training dataset, and trained their image model on the multi-
both input and output of their model are single-channel, while both in- exposure image dataset with the unsupervised SSIM loss. However,
put and output of our model are three-channel. Therefore, during fusing low-resolution of image dataset and no ground-truth fusion images will
the multi-exposure images, Y channel of each source image is extended absolutely limit performance of the trained image fusion models.
to three Y channels before inputting into our image fusion model. As reported in the previous literatures, the multi-focus image
dataset could be more easily generated compared to other types of
𝑌 ′ = 0.299 × 𝑅 + 0.587 × 𝐺 + 0.114 × 𝐵, (2)
image dataset, and more importantly, the ground-truth fusion images
where R, G and B respectively correspond to the three channels of the of the multi-focus images could be obtained simultaneously while gen-
produced fusion image. erating the dataset. Due to the characteristics of optical lens, focused
In the end of this subsection, we have demonstrated the performance and blurred portions of the naturally captured images are generally
of feature extraction module and feature fusion module on one pair related to the scene depth. Thus, a reasonable way for generating
of multi-focus images, as shown in Fig. 2. We can see from Figs. 2(d) the multi-focus image dataset is to create the partially-focused image
and (e) that the feature extraction module has extracted extensive pairs from the RGB-D image sets. As is known, NYU-D2 is a famous
feature maps from Figs. 2(a) and (b), ranging from informative edge indoor dataset for depth estimation, which consists of thousands of
details to flat basic elements. Figs. 2(d)–(f) show that the sharp features RGB and depth image pairs. In our previous work [35], we have built
extracted from Figs. 2(a) and (b) have been successfully integrated a large-scale NYU-D2 training set by uniformly sampling the training
into Fig. 2(f) by the feature fusion module. For the clear observation, sequences of the NYU-D2 raw dataset, and the training dataset consists
a set of feature maps bounded by red boxes in Figs. 2(d)–(f) have been of about 100,000 pairs of RGB and depth images all of which have been
separately shown in Figs. 2(g)–(i) and projected to HSV color space for resized to 422 × 321. Therefore, in this paper, we have online generated
better visualization effect. We can find that the edge details of Fig. 2(g) a large-scale multi-focus image dataset based on our previously built
are concentrated on the near focused boy, the edge details of Fig. 2(h) NYU-D2 training dataset in a more reasonable and intuitive way.
are concentrated on the far focused background, and the edge details To be specific, during training the model, our online multi-focus im-
in both Figs. 2(g) and (h) have been successfully integrated by our age dataset is simultaneously generated from pairs of RGB and depth
feature fusion module into the feature map in Fig. 2(i). Finally, Fig. 2(c) images of NYU-D2 dataset as the following procedures:
demonstrates that the fusion image reconstructed from Fig. 2(i) by our
feature reconstruction module has well combined the clear portions of (1) a complete blurry image Ib is generated by randomly blurring the
Figs. 2(a) and (b). The results in Fig. 2 can validate the effectiveness of source RGB image Is with Gaussian filter, which can be expressed
our feature extraction module and feature fusion module. as

𝐼𝑏 = 𝐺 ∗ 𝐼𝑠 , (3)
2.3. Training dataset
where ∗ denotes the convolution operation, and G denotes the
As is known, the CNN models are data-driven and a well-generated Gaussian kernel, which is generated with random kernel radius
large-scale image dataset is the foundation of this kind of algorithms. kr ranging from 1 pixel to 15 pixels according to Eq. (4).
In [31], Liu et al. assumed that each pair of multi-focus image patches 2
𝑥 +𝑦 2
1
mainly had two classes: (1) the first patch was focused and the second 𝐺(𝑥, 𝑦) = (√ ) 𝑒 2𝜎 2 , (4)
one blurred, and (2) the first patch was blurred and the second one 2𝜋𝜎
focused. Therefore, they generated a large image dataset consisting
where 𝜎 denotes the standard deviation of the Gaussian filter and
of 2,000,000 pairs of image patches of size 16 × 16, by randomly
can be calculated as 𝜎 = 0.3 × (𝑘𝑟 − 1) + 0.8.1
cropping the focused patches from the ImageNet dataset and blurred
(2) the focus map Im for determining the focused and blurred
ones by blurring the focused patches with random scale of Gaussian
portions of the multi-focus images is generated according to
kernel. Tang et al. proposed p-CNN to learn effective focus-measure,
the random depth range. Specifically, the near portions and
where their p-CNN was formulated as a simple image classification task
far portions of the scene are separated by the random depth
(three categories: each image patch is focused, blurred, or unknown).
threshold dth ranging from 0.3 to 0.7 percent of the maximum
Based on this assumption, they generated a large-scale image dataset
scene depth. Then, the near portions (where depth is less than or
containing about 1,450,000 image patches of size 32 × 32. In this image
equal to dth ) of Im are set as 1, and the far portions (where depth
dataset, there are 650,000 focused patches, 700,000 blurred patches
is greater than dth ) of Im are set as 0.
and 100,000 unknown type of patches, which were rendered with 12
handcrafted blurring masks. In [34], since there is also no large-scale
multi-exposure image dataset, Prabhakar et al. generated their multi- 1
https://docs.opencv.org/3.4.2/d4/d86/group__imgproc__filter.html#
exposure image dataset by randomly cropping patches of size 64 × 64 gac05a120c1ae92a6060dd0db190a61afa.

103
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 2. Demonstration of feature extraction


and feature fusion. (a) and (b) are a pair of
multi-focus images. (c) is the fusion image
of (a) and (b) produced by our image fusion
model. (d) and (e) are respectively the 64
feature maps of (a) and (b) extracted by our
feature extraction module (after CONV2).
(f) shows the fused 64 feature maps of (a)
and (b) by the feature fusion module (after
FUSE). (g)–(i) indicate the closeups of fea-
ture maps bounded by red boxes in (d)–(f),
which are projected to HSV color space for
the clear observation of feature details. (For
interpretation of the references to color in
this figure legend, the reader is referred to
the web version of this article.)

(3) a pair of multi-focus images are generated according to the RGB In this way, our multi-focus image dataset can be generated more
image Is , blurry image Ib and focus map Im as: the near focused naturally compared to the previous synthetic datasets, and the number
image I1 and far focused image I2 can be generated according to of our multi-focus image pairs should be infinite owing to our random
Eq. (5). Naturally, I1 and I2 form a pair of multi-focus images, generation method. Moreover, the focus maps generated by our method
and Is is their ground-truth fusion image Ig . vary with the random depth threshold, and thus our method could
produce more diverse multi-focus image than the previous methods.
{ ( ) Finally, the source RGB image can be taken as the ground-truth fusion
𝐼1 = 𝐼𝑠 ⊙ 𝐼𝑚 + 𝐼𝑏 ⊙ 𝟏 − 𝐼𝑚
( ) , (5) image of the corresponding pair of multi-focus images, which is a great
𝐼2 = 𝐼𝑠 ⊙ 𝟏 − 𝐼𝑚 + 𝐼𝑏 ⊙ 𝐼𝑚
advantage over the previous dataset generation methods. Fig. 3 shows
where 1 denotes the all one matrix of the same size with Is , and a demonstration example of our dataset generation method. It can be
⊙ denotes the elementwise product. seen that the focus map in Fig. 3(d) has been reasonably generated by
(4) in the training phase, the randomly generated multi-focus images step (2) according to the depth range, and the multi-focus images in
and the corresponding source RGB images are further augmented Figs. 3(e) and (f), generated by step (3) according to the RGB image,
by random resized-crop and random flips. Specifically, the final blurred image and focus map, are just like the naturally captured ones.
inputs and ground truth of our image fusion model are generated As described in step (4), we have further augmented the multi-focus
from each set of images (i.e., a pair of multi-focus images and one images and ground-truth images by random resized-crop, vertical flip
ground-truth fusion image) as follows. For each set of images, the and horizontal flip, thus we also show four sets of online generated
two multi-focus images and the ground-truth fusion image are si- multi-focus images and ground-truth fusion images for training our
multaneously processed by three procedures: firstly cropped by models in Fig. 4, from which we can see the blurring styles of the
a random scale ranging from 0.5 to 1, then resized to 224 × 224, generated multi-focus images are related to the depth range and look
and finally randomly flipped in both vertical and horizontal di- natural.
rections. In the end, the processed multi-focus images are taken as Overall, compared to the previous datasets, our online gener-
the final inputs of the image fusion model, and the corresponding ated multi-focus image dataset has four advantages: (1) having much
ground-truth fusion image is taken as the ground truth. larger image resolution, i.e., 224 × 224 compared to 16 × 16, 32 × 32 and

104
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 3. Demonstration of generating multi-focus image dataset according to Section 2.3. (a) and (b) are a pair of RGB image and depth image in the NYU-D2 dataset.
(c) is the randomly blurred image according to step (1). (d) is the randomly generated focus map according to step (2). (e) and (f) are a pair of multi-focus images
generated from (a), (c) and (d) according to step (3).

64 × 64, (2) consisting of more multi-focus image pairs, (3) having more last convolutional layer of ResNet101, as shown in Eq. (6).
diverse blurring styles compared to fully focused-blurred style [31] and
1 ∑[ ]2
12 handcrafted blurring styles [32], and (4) owning ground-truth 𝑃𝑙𝑜𝑠𝑠 = 𝑓𝑝𝑖 (𝑥, 𝑦) − 𝑓𝑔𝑖 (𝑥, 𝑦) , (6)
𝐶𝑓 𝐻𝑓 𝑊𝑓 𝑖,𝑥,𝑦
fusion images. Owing to these advantages, our multi-focus image
dataset can be used to fully and finely train the CNN based image fusion where fp and fg respectively denote feature maps of the predicted fusion
models. image and ground-truth fusion image. i denotes the channel index of
feature maps. Cf , Hf and Wf denote the channel number, height and
2.4. Loss function width of feature maps.
1 ∑[ ]2
Prior to using the learning based algorithms, the model’s parameters 𝐵𝑙𝑜𝑠𝑠 = 𝐼 𝑖 (𝑥, 𝑦) − 𝐼𝑔𝑖 (𝑥, 𝑦) , (7)
3𝐻𝑔 𝑊𝑔 𝑖,𝑥,𝑦 𝑝
should be optimized with the appropriate loss function so as to obtain
predictions similar to the ground truth. In this paper, the target of where Ip and Ig respectively denote the predicted fusion image and
our image fusion model is to regress one informative fusion image ground-truth fusion image. i denotes the channel index of RGB images.
from multiple input images. Mean square error (MSE) is a basic but Hg and Wg denote the height and width of the ground-truth fusion image.
often used loss function to regularize prediction of the model close During training the model, we firstly choose the mean square error
to the ground-truth output. However, due to the characteristics of (MSE) of the predicted fusion image and the ground-truth image as the
L2 -norm, only regularizing the model with MSE loss probably yields basic loss (calculated as Eq. (7)) to pretrain the proposed model. After-
smooth fusion images. To solve this problem, the researchers usually wards, we add the proposed perceptual loss to the basic loss to finely
introduce perceptual loss functions [42,43] to facilitate regularizing train the model, calculation of which can be expressed as Eq. (8).
the network to produce images having more structural similarity with
𝑇𝑙𝑜𝑠𝑠 = 𝑤1 𝐵𝑙𝑜𝑠𝑠 + 𝑤2 𝑃𝑙𝑜𝑠𝑠 , (8)
the ground-truth fusion images.
The perceptual loss functions are usually formulated as the mean where w1 and w2 respectively denote the weight coefficients of the
square error of high-level (deep) convolutional features of the output basic loss and perceptual loss. In this paper, w1 and w2 are both set to
image and ground-truth image. Since the high-level features are usually 1, which is verified effective by the extensive experimental results.
extracted by the CNNs pretrained for image classification, thus differ- Since all components of the proposed model and loss function are
ence of the high-level features of the output image and ground-truth differentiable, thus the parameters of the proposed image fusion model
image could be used to discriminate whether they belong to the same can be learned through the error back-propagation method. Specially,
category or they are the same object. In [43], the authors adopted the in this paper, the parameters of our models are all updated through
features of the fourth convolutional layer of the VGG16 [44] pretrained the stochastic gradient descent (SGD) back-propagation. More training
on ImageNet to abstract the high-level representations of input im- details can be found in Section 2.5.
ages. As is known, ResNet101 [45] could achieve better performance
and extract deeper convolutional features than VGG16, and ideally 2.5. Training details
the deeper convolutional features of ResNet101 could obtain better
abstraction of images. Therefore, in this paper, we used the features of As reported in many literatures, training details are important for
the last convolutional layer of the ResNet101 pretrained on ImageNet repeating the training process of the CNN models. Therefore, we have
to construct our perceptual loss. To be specific, the proposed perceptual described our training methods in detail as follows.
loss is formulated as the mean square error of feature maps of the First of all, the proposed model was pretrained on our online
predicted fusion image and ground-truth fusion image extracted by the generated multi-focus image dataset with minibatch equal to 64 and

105
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 4. Four sets of the generated multi-focus images and ground-truth fusion images. Each row shows a set of images, which respectively are the near focused
image, far focused image and their ground-truth fusion image from left to right.

under the regularization of the basic loss (Bloss ) for 5000 iterations, learning rate lri in the ith iteration equals to the basic learning rate
during which the momentum of the batch-normalization (BN) layers lr0 multiplying (1 − 𝑖∕𝑚𝑎𝑥𝐼 )𝑝𝑜𝑤𝑒𝑟 , which can be expressed as Eq. (9).
was linearly decreased from 0.99 to 0. Afterwards, we freezed the During the fine-training procedure, the online generated multi-focus
parameters of the BN layers and adopted our integrated loss (Tloss ) dataset was further augmented by randomly tuning HSV channels of
to finely train the parameters of the convolutional layers for 60,000 input images and ground-truth image, by multiplying each channel
iterations. Because Tloss requires relatively large computational resource by a random ratio ranging from 0.8 to 1.2. Then, the color space of
when fine-training the model, thus minibatch was decreased from 64 input images and ground-truth image were randomly transformed to
to 32 in the fine-training procedure. grayscale at the probability of 0.5, which was helpful for the image
As for settings of learning rates, the basic learning rates during fusion model to prevent color-shifting problem.
pretraining and fine-training are both set to 0.01, and are gradually
decreased to 0 according to the ‘poly’ learning rate policy. That is, the
𝑙𝑟𝑖 = 𝑙𝑟0 × (1 − 𝑖∕𝑚𝑎𝑥𝐼 )𝑝𝑜𝑤𝑒𝑟 , (9)

106
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

where maxI denotes the maximum permitted iteration number and 3.1.2. Image datasets
power is used to tune the decreasing rate of lri . In this paper, power is In order to fully demonstrate the effectiveness of the proposed
set to 0.9 in all experiments. image fusion model, we have evaluated the compared algorithms on
Finally, all the proposed models are implemented in the PyTorch four types of image datasets, including multi-focus image datasets [46],
framework,2 and trained and tested on a platform with Intel Core infrared and visual image dataset [15], multi-modal medical image
i7-3770k CPU and NVIDIA TITAN X GPU. The consumption amounts dataset [47], and multi-exposure image dataset [48]. The four image
of computational resources during the pretraining, fine-training, and datasets are respectively shown in Figs. 5–8.
inference procedures are respectively detailed as follow. Pretraining
our model occupies about 10 GB GPU memory and takes about 8.5 h, 3.1.3. Evaluation methods
fine-training the model occupies about 9 GB GPU memory and takes During evaluating the image fusion algorithms, we have adopted
about 44 h, and inferring one fusion image of size 520 × 520 from two both qualitative and quantitative methods to discriminate the per-
input images occupies about 785 MB GPU memory and takes about formance of different image fusion algorithms. Firstly, qualitative
0.02 s. Therefore, our proposed IFCNN can be conveniently deployed evaluation is performed by judging the visual effects of their fusion im-
in the realtime applications, without consuming much computational ages with respect to the source images. To be specific, whether the visual
resources. The implementation code of our image fusion model will be effects of fusion images are satisfying for each type of image dataset can
available on the project page.3 be judged by the following criteria: (1) multi-focus image fusion should
integrate as much clear and sharp features as possible from each source
image into the fusion image, (2) infrared and visual image fusion should
3. Experimental results and discussions
preserve as much visible appearance information as possible from the
visual image and inject as much salient bright features as possible from
In this section, we have done extensive experiments to validate the
the infrared image into the fusion image, (3) multi-modal medical image
performance of the proposed image fusion model. Firstly, the experi-
fusion should combine as much typical features as possible from source
mental settings are briefly described, then qualitative and quantitative
images of different modalities into the fusion image, and (4) multi-
results are illustrated and discussed, and conclusions of this section are
exposure image fusion should inject as much clear middle-exposure fea-
made in the end.
tures as possible from each source image into the fusion image. Besides
the above criteria, another important criterion is that the fusion images
3.1. Experimental settings should look as natural as possible so that the human eyes can easily
and accurately obtain the comprehensive information from the fusion
In order to verify the advantages of our proposed model, we have image.
compared it with four representative image fusion algorithms on four Since only comparing the visual quality probably cannot objectively
types of image datasets, and evaluated the algorithms in both the and fairly discriminate the performance of different image fusion
qualitative and quantitative ways. The comparison algorithms, image algorithms. Therefore, we further utilize five often used metrics to
datasets, and evaluation methods are respectively introduced below. evaluate the quantitative performance of the algorithms on the multi-
focus, infrared-visual and multi-modal medical image datasets. The
five metrics are respectively the visual information fidelity (VIFF) [49],
3.1.1. Comparison algorithms
improved structural similarity (ISSIM) [50], normalized mutual
Since the proposal of our image fusion model is inspired by the
information (NMI) [51], spatial frequency (SF) [52], and average
framework of transform-domain image fusion algorithms, thus we com-
gradient [53]. Among the five metrics, VIFF measures the visual infor-
pare our proposed model with one representative transform-domain
mation fidelity of the fusion image with respect to source images, ISSIM
algorithms to validate the effectiveness of our proposed model, i.e.,
measures the structural similarity between the fusion image and source
multi-scale transform and sparse representation based image fusion
images, NMI measures the information amount of the fusion image that
algorithm (LPSR) [24]. Besides, we compare our model with the state-
has been preserved from two source images, and SF and AG measure
of-the-art guided filtering based image fusion algorithm (GFF) [4],
the textural information amount of the fusion image from two different
which is also a general image fusion algorithm. Even though our
statistical views. Evaluating fusion results with these five metrics can
model is only trained on the multi-focus image dataset, it is designed
effectively reflect the algorithms’ abilities on integrating the visual
as a general image fusion model for fusing various types of images.
information, structural information, and on merging the image details,
Therefore, we further compare our model with two existing image
therefore the selection of these five metrics is appropriate.
fusion models (i.e., multi-focus image fusion model (MFCNN) [31] and
Especially, there are more than two source images in each set of
multi-exposure image fusion model (MECNN) [34]), to validate that the
multi-exposure images (see Fig. 8), thus the metrics (VIFF, ISSIM and
proposed model could achieve comparable or even better performance
NMI) designed for two input images are not valid while evaluating
than the current state-of-the-art image fusion models.
algorithms on the multi-exposure image dataset. Therefore, we select
As for our own algorithms, we have compared the above four al-
SF, AG and another structural similarity metric MESSIM [7] (especially
gorithms with three models implemented with elementwise-maximum,
designed to evaluate the multi-exposure image fusion algorithms), to
elementwise-mean and elementwise-sum fusion rules, respectively
quantify the performance of the algorithms on fusing multi-exposure
named as IFCNN-MAX, IFCNN-MEAN and IFCNN-SUM. As discussed
images. Finally, note that greater values of VIFF, ISSIM, NMI, SF, AG
in Section 2.2, we choose IFCNN-MAX as our chief model to fuse the
and MESSIM indicate better performance of the algorithms. In the next
multi-focus images, infrared and visual images, and multimodal medical
four subsections, the evaluation results on four types of image datasets
images, and choose IFCNN-MAX trained only with Bloss (abbreviated
are respectively described and discussed.
as BASELINE-MAX) as the baseline model for fusing these three types
of images. In addition, we choose IFCNN-MEAN as our chief model to
3.2. Multi-focus image fusion
fuse the multi-exposure images, and accordingly choose IFCNN-MEAN
trained only with Bloss (abbreviated as BASELINE-MEAN) as the baseline
Since our model has been only trained on the multi-focus image
model for fusing the multi-exposure images.
dataset, thus we firstly want to investigate the performance of the
proposed model on fusing multi-focus images. Additionally, we also
2
https://pytorch.org. want to test the effectiveness of the perceptual loss and fusion rules
3
https://github.com/uzeful/IFCNN. on this dataset. In order to achieve the above two purposes, we

107
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 5. The multi-focus image dataset. This dataset includes 20 pairs of near and far focused images. In each pair of images, the left one is the near focused image
and the right one is the far focused image.

Fig. 6. The infrared and visual image dataset. This dataset includes 14 pairs of infrared and visual images. In each pair of images, the left one is the visual image
and the right one is the infrared image.

have evaluated the algorithms on the multi-focus image dataset (see fence of the near focused Fig. 9(a) and the clear volleyball court of the far
Fig. 5), and shown two comparison examples on this dataset in Figs. 9 focused Fig. 9(b) together into the fusion image. Fig. 9(c) shows that the
and 10. fusion image of GFF shows a little blurring effect around the fence (see
Fig. 9(c)–(j) show the fusion results of Fig. 9(a) and (b), which cap- the closeups) than other algorithms, which might be caused by its
tured the volleyball court in front of the chain-link fence. The ideal fu- smoothing operation on weight map. It can been seen from Fig. 9(e)
sion of Fig. 9(a) and (b) should directly combine the clear chain-link that MFCNN fails to fuse the clear court behind the fence corner (as

108
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 7. The multi-modal medical image dataset. This dataset includes eight pairs of multi-modal medical images. In the first two pairs of images on the top row, the
left one is CT image and the right one is MR image. In other pairs of images, the two images are MR images of different modalities.

Fig. 8. The multi-exposure image dataset. This dataset includes six sets of multiple images with different exposure degrees. Each row shows one set of multi-exposure
images, in which exposure degrees of images from left to right are gradually ranging from under-exposure to over-exposure.

pointed by the red arrow in the closeup) into the fusion image due to Figs. 10(f) shows that the fusion image of MECNN is still of low-contrast,
its inaccurate focus map, while the other algorithms have all integrated which implies the generalization ability of MECNN is weak for fusing
this clear portion into their fusion images. Fig. 9(f) shows that MECNN multi-focus images. In the end, Figs. 10(d) and (g)–(j) show that LPSR
has fused the salient features into the fusion image, contrast of which and our proposed four models have finely injected the sharp features
however has been degraded by a large margin compared to the source into their fusion images, and achieve comparable performance.
images. Finally, as shown in Figs. 9(d) and (g)–(j), the fusion images of Besides the qualitative comparison, we have also calculated the five
LPSR and our IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN- quantitative metrics of the fusion results as described in Section 3.1.
MAX all have well integrated the clear features of both source images, The quantitative results are listed in Table 1. In this table and the tables
and showed better visual effects compared to those of other algorithms. in the following subsections, the value in bold font and value in italic
Fig. 10(c)–(j) shows the fusion results of Fig. 9(a) and (b), which font respectively denote the best result and second-best result in the
captured a souvenir in front of Sydney Opera House. The ideal fusion corresponding metric row. Each value before the bracket denotes the
image of Fig. 10(a) and (b) should directly combines the clear souvenir mean metric value on the full dataset, and the three integers in each
and fingers of the near focused Fig. 10(a) and the clear background of bracket respectively represent the overall rank, the number of fusion
the far focused Fig. 10(b). It can be seen from Figs. 10(c) and (e) that images ranking first and the number of fusion images ranking second
the fusion images of GFF and MFCNN suffer from blurring effect around of the current algorithm under the evaluation of the current metric
the koala’s right ear (as pointed by the red arrows in the closeups). among all algorithms. We can see from the results in Table 1 that LPSR

109
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 9. The comparison example on the fifth pair of multi-focus images. (a) and (b) are fifth pair of multi-focus images. (c)-(j) are the fusion images of (a) and (b)
respectively by GFF, LPSR, MFCNN, MECNN, IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN-MAX. In each subfigure, the image patch bounded by green
box indicates the closeup of the image patch bounded by red box. (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of this article.)

Table 1
Quantitative evaluation results on multi-focus image dataset.

Metrics GFF LPSR MFCNN MECNN IFCNN-SUM IFCNN-MEAN BASELINE-MAX IFCNN-MAX

VIFF 0.9806(5,0,6) 0.9930(1,9,2) 0.9806(4,0,1) 0.5606(8,0,0) 0.9800(7,1,2) 0.9821(6,4,2) 0.9809(3,3,5) 0.9823(2,3,2)


ISSIM 0.6187(6,1,3) 0.6177(7,0,1) 0.6205(5,4,3) 0.5463(8,2,0) 0.6312(3,3,3) 0.632(1,5,4) 0.6318(2,4,3) 0.6293(4,1,3)
NMI 1.044(2,0,16) 0.9817(3,0,1) 1.098(1,18,2) 0.8436(8,2,1) 0.8576(6,0,0) 0.8535(7,0,0) 0.8664(5,0,0) 0.9034(4,0,0)
SF 19.29(4,0,3) 19.41(2,8,7) 19.21(5,0,0) 7.888(8,0,0) 18.93(7,1,0) 19.01(6,0,1) 19.33(3,6,1) 19.42(1,5,8)
AG 2.864(3,1,3) 2.891(1,12,4) 2.854(4,0,0) 1.194(8,0,0) 2.834(7,2,1) 2.844(5,1,2) 2.837(6,0,0) 2.886(2,4,10)

achieves the highest VIFF metric value and our chief model IFCNN-MAX details than other algorithms. Especially, due to the degradation of
ranks second, which implies these two algorithms have obtained higher contrast information, MECNN has obtained the lowest values on all
visual information fidelity compared to other algorithms. As for the metrics. According to the comprehensive evaluation on the multi-focus
ISSIM metric, our proposed four models have achieved greater values image dataset, our IFCNN-MAX could retain relatively higher visual
than other algorithms, which indicates our proposed models have pre- information fidelity, preserve more structural information, and produce
served more structural information than other algorithms. In addition, informative fusion images compared to other algorithms. Therefore,
differences of ISSIM metric values within our four models are relatively our IFCNN-MAX has demonstrated comparable or even better perfor-
small, which indicates our four models have preserved comparable mance on the multi-focus image dataset compared to other state-of-art
structural information from input images to their fusion images. As algorithms.
is known, NMI relates to the joint distribution of gray values, and its As for the ablation study, we can see that our chief model IFCNN-
value is greater if the distribution of gray values between input images MAX has obtained greater metric values than the baseline model
and fusion image is more similar. Thus, direct combination or slight BASELINE-MAX on most of the metrics except ISSIM, and also achieved
weighted addition of input images (MFCNN and GFF) could obtain better quantitative results than IFCNN-SUM and IFCNN-MEAN in most
higher NMI metric values compared to the general transform-domain of cases. Therefore, in total, the quantitative results on the multi-focus
image fusion algorithms (LPSR, MECNN and our IFCNNs). In comparing image dataset indicate the elementwise-maximum fusion rule performs
SF and AG, our IFCNN-MAX and LPSR obtain close metric values and better than elementwise-sum and elementwise-mean for fusing multi-
respectively rank the best place and second place, which implies focus images, and the model trained with perceptual loss outperforms
IFCNN-MAX and LPSR have produced fusion images with more textural the model trained only with MSE loss.

110
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 10. The comparison example on the 14th pair of multi-focus images. (a) and (b) are 14th pair of multi-focus images. (c)-(j) are the fusion results of (a) and (b)
respectively by GFF, LPSR, MFCNN, MECNN, IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN-MAX. In each subfigure, the image patch bounded by green
box indicates the closeup of the image patch bounded by red box. (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of this article.)

3.3. Infrared and visual image fusion Figs. 12(a) and (b) are the 10th pair of infrared and visual images,
which captured the night scene of a street with several persons and
In this subsection, we have compared the image fusion algorithms on two cars. Ideally, fusion of Figs. 12(a) and (b) should directly inject
the infrared and visual image dataset (see Fig. 6), and two comparison the salient bright features of persons, cars and traffic lights from the
examples have been shown in Figs. 11 and 12. infrared image into the visual image, so that the fusion image can
Figs. 11(c)–(j) show the fusion results of Figs. 11(a) and (b), which preserve most of visual appearance features of the visual image while
captured the outdoor scene with a person standing in the mountain. A integrating the salient bright features of the infrared image. Figs. 12(c)
good fusion image of Figs. 11(a) and (b) should preserve as much as and (d) show that GFF and LPSR have integrated too much useless
salient bright features (person and several spots in this case) and also bright background features into their fusion images, which makes
maintain the visible appearance features (house, fence and trees in this the local-contrast of their fusion images lower than that of other
case) from the visual image [16]. Fig. 11 shows that all algorithms algorithms (except MECNN). The fusion image of MFCNN yields the
have somewhat integrated the salient features of infrared and visual region artifact around the person (pointed by the yellow arrow in the
images into their fusion images. However, the fusion images of MECNN, closeup of Fig. 12(c)) and also fails to integrate much useful bright
IFCNN-SUM and IFCNN-MEAN (see Figs. 11(f)–(h)) have relatively low features (such as the bright persons at top-right corner and two bright
contrast compared to those of other algorithms. Figs. 11(e) shows that traffic lights beside road), due to inappropriate focus map generated by
MFCNN fails to preserve much visible appearance features of fence and MFCNN. Fig. 12(f) shows that MECNN has integrated the salient fea-
trees from the visual image and also fails to integrate one bright spot tures of infrared and visual image into the fusion image, which however
on top of the infrared image into the fusion image. While Figs. 11(c) is still under low-contrast. As for our four models, the fusion images
and (d) show that GFF and LPSR have lost more appearance features of of IFCNN-SUM and IFCNN-MEAN (see Figs. 12(g) and (h)) have lower
fence and trees around the person (pointed by the yellow arrows in the contrast than those of BASELINE-MAX and IFCNN-MAX (see Fig. 12(i)
closeups) compared to IFCNN-MAX and BASELINE-MAX. Furthermore, and (j)). Finally, Figs. 12(i) and (j) show that both BASELINE-MAX and
we have observed their closeups around the person in Figs. 11(i) and IFCNN-MAX have mainly injected the useful bright features from the
(j) to closely compare IFCNN-MAX and BASELINE-MAX, and it can be infrared image into the fusion image and preserved most of the visible
seen that IFCNN-MAX has integrated more complete bright features of appearance features from the visual image to the fusion image, thus the
the person into the fusion image than BASELINE-MAX. However, the fusion images of BASELINE-MAX and IFCNN-MAX are more suitable
overall appearances of Figs. 11(i) and (j) show that BASELINE-MAX has and comprehensive for visual perception than those of other algorithms.
preserved more visible features (see the trees in the bottom left corner) Compared with BASELINE-MAX, the fusion image of IFCNN-MAX has
from the visual image than IFCNN-MAX. integrated more useful infrared features as pointed by the yellow arrows

111
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 11. The comparison example on the second pair of the infrared and visual images. (a) and (b) are respectively the visual image and infrared image, and (c)–(j)
are the fusion results of (a) and (b) respectively by GFF, LPSR, MFCNN, MECNN, IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN-MAX. In each subfigure,
the image patch bounded by green box indicates the closeup of the image patch bounded by red box. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)

Table 2
Quantitative evaluation results on infrared and visual image dataset.

Metrics GFF LPSR MFCNN MECNN IFCNN-SUM IFCNN-MEAN BASELINE-MAX IFCNN-MAX

VIFF 0.5507(5,0,3) 0.7327(2,6,2) 0.7757(1,6,0) 0.3167(8,0,0) 0.4506(6,0,0) 0.4495(7,0,0) 0.6621(3,2,4) 0.6405(4,0,5)


ISSIM 0.4272(8,2,1) 0.4364(7,0,3) 0.6084(1,7,0) 0.4395(6,0,1) 0.5039(3,0,4) 0.5041(2,2,1) 0.5002(4,3,4) 0.4956(5,0,0)
NMI 0.3614(6,0,4) 0.365(5,1,1) 1.047(1,12,1) 0.3783(3,1,1) 0.3384(8,0,0) 0.3402(7,0,0) 0.3945(2,0,7) 0.3699(4,0,0)
SF 9.623(6,0,0) 10.48(3,0,4) 9.503(7,1,0) 6.242(8,0,0) 9.936(5,0,0) 9.955(4,0,0) 11.08(2,4,6) 11.31(1,9,4)
AG 1.505(6,0,0) 1.678(3,1,3) 1.358(7,0,0) 0.9449(8,0,0) 1.600(4,0,0) 1.598(5,0,0) 1.781(2,1,9) 1.865(1,12,2)

in the closeups of Figs. 12(i) and (j), and has preserved much visible ap- the metric values are listed in Table 2. Different from multi-focus im-
pearance features. Thereby, IFCNN-MAX outperforms BASELINE-MAX ages, correlation of the infrared image and visual image is usually low
by a bit for fusing infrared and visual images. and thus their features are often directly complementary. Therefore,
Besides the above two comparison examples, the fusion results directly combining the salient regions of the infrared image and visual
on another four pairs of infrared and visual images have been briefly image would usually yield additive effect (i.e., achieving high values)
illustrated in Fig. 13 to further validate our previous conclusions. In on metrics that measure the mutual feature or information correlation
each column of Fig. 13, the top two rows show a pair of visual and between input images and fusion image, as the VIFF, ISSIM and NMI
infrared images, and the bottom five rows from top to bottom show values of MFCNN shown in Table 2. This is the reason why although
the fusion images of the visual image and infrared image respectively most fusion images of MFCNN yield inappropriate region effects as
by GFF, LPSR, MFCNN, MECNN and IFCNN-MAX. As shown in third shown in Fig. 13, MFCNN could still obtain such high values on VIFF,
and fourth rows of Fig. 13, GFF and LPSR have only integrate very ISSIM and NMI. Except MFCNN, our BASELINE-MAX and IFCNN-MAX
few visible appearance features from the visual images to their fusion could achieve relatively high metric values on VIFF, ISSIM and NMI
images in most of cases, which thus will impact the visual perception metrics, thus our two chief models could preserve much visual informa-
of the supervised scene. The fifth row of Fig. 13 shows that the fusion tion and structural information and maintain much original gray-value
images of MFCNN suffer from severe region artifacts in all four cases, distribution. Moreover, the metric values of SF and AG rank the
and the fusion image in second column has even lost the visible head algorithms in consistence to our visual judgement. Especially, our chief
features of the visual image and the important bright gun features of models IFCNN-MAX and BASELINE-MAX respectively rank the first
infrared image. As for MECNN, the sixth row of Fig. 13 shows that all place and second place, which means these two models could produce
fusion images of MECNN have much lower contrast compared to those fusion images with more textural details compared to other algorithms.
of other algorithms. Finally, the bottom row of Fig. 13 shows that the Overall, the qualitative and quantitative evaluation results on the
fusion images of IFCNN-MAX have appropriately combined the visible infrared and visual image dataset imply that the fusion images of
appearance features of visual image and the salient bright features of IFCNN-MAX have showed best visual effects and preserved much textu-
infrared image and show the best visual effect in most of cases. ral information from input images, and IFCNN-MAX has demonstrated
Moreover, we have evaluated the quantitative performance of the better generalization ability for fusing various types of images than
image fusion algorithms on the infrared and visual image dataset, and MFCNN and MECNN.

112
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 12. The comparison example on the 10th pair of the infrared and visual images. (a) and (b) are respectively the visual image and infrared image, and (c)-(j) are
the fusion results of (a) and (b) respectively by GFF, LPSR, MFCNN, MECNN, IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN-MAX. In each subfigure, the
image patch bounded by green box indicates the closeup of the image patch bounded by red box. (For interpretation of the references to color in this figure legend,
the reader is referred to the web version of this article.)

3.4. Medical image fusion has beat BASELINE-MAX in almost all metrics except VIFF, thus the
perceptual loss is effective for boosting the performance of our image
In this subsection, we have evaluated the image fusion algorithms on fusion model. Overall, according to the qualitative and quantitative
the multi-modal medical image dataset (see Fig. 7), and one comparison evaluations, IFCNN-MAX could inject more useful features of input
example is shown in Fig. 14. images into the fusion image, and perform comparably or even better
Figs. 14(c)–(j) show the fusion results of Figs. 14(a) and (b), which than other algorithms for fusing multi-modal medical images.
are a pair of CT and MR slices scanned along the axial plane of human
brain. Ideally, fusion of Figs. 14(a) and (b) should integrate the bright 3.5. Multi-exposure image fusion
skull features of the CT image and also the textural tissue features of
the MR image into the fusion image. Figs. 14(c)–(e) show that GFF, Finally, the image fusion algorithms are evaluated on the multi-
LPSR and MFCNN have failed to inject some portions of bright skull exposure image dataset (see Fig. 8), and one comparison example is
features (pointed by the yellow arrows in the closeups) from the CT illustrated in Fig. 15.
image into the fusion image, and especially the fusion image of MFCNN Figs. 15(a)–(e) show that this image set contains five source images
lacks the largest portion of skull features. It can be seen from Fig. 14(f) ranging from low-exposure degree to high-exposure degree, which
that MECNN has successfully integrated the salient features of the gradually capture the outdoor scene to the indoor layout. The ideal
CT and MR images, but its fusion image suffers from blurring effect. fusion image of this image set should integrate the clear portions of
Finally, Figs. 14(g)–(j) show that our proposed four models have fused each source image, which usually correspond to the image portions
most of the bright skull and textural tissue features of the CT and MR under appropriate exposure degree (i.e., middle exposure degree). To be
images into their fusion images, among which the fusion image of specific, the ideal fusion image should should integrate outdoor scene
IFCNN-MAX has integrated most bright skull features. Therefore, in including blue sky, table and house from Fig. 15(a) and (b), the bottom
this comparison example, IFCNN-MAX has generated the fusion image indoor layout from Figs. 15(c) and (d), and the top indoor layout from
with better visual quality than other algorithms. Figs. 15(d) and (e). It can be seen from Figs. 15(f)–(m) that except
As discussed in Section 3.3, if the input images have low correlation MECNN all the other algorithms have integrated the visible outdoor
between each other, MFCNN could obtain higher values of VIFF, ISSIM scene and indoor layout together into their fusion images. The reason
and NMI. Since the medical images of different modalities are also that MECNN fails to fuse this set of multi-exposure images might be
lowly correlated due to their different imaging principles, thus the because MECNN is originally trained to fuse two images and cannot be
similar phenomenon also occurs on the multi-modal medical image directly applied to well fuse more than two images, which indicates the
dataset, i.e., it can be seen from Table 3 that MFCNN has ranked the first weak generalization of MECNN for changing the number of inputs. Even
place in all VIFF, ISSIM and NMI metrics. Except MFCNN, LPSR have though GFF, LPSR, IFCNN-SUM and IFCNN-MAX have integrated the
obtained higher VIFF, ISSIM and NMI values than other algorithms, visible outdoor and indoor features, their fusion images have somewhat
and our BASELINE-MAX and IFCNN-MAX just rank after LPSR. While inappropriate visual effects compared to those of BASELINE-MEAN
our IFCNN-MAX and BASELINE-MAX still rank the first and second and IFCNN-MEAN. For instance, it can be seen from the closeups in
place on the metric values of SF and AG, which indicates IFCNN-MAX Figs. 15(f) and (g) that the fusion images of GFF and LPSR show lower
could integrate much more textural information from input images contrast on indoor layout than those of our four models. Fig. 15(h)
into the fusion image than other algorithms. In addition, IFCNN-MAX shows that MFCNN mainly combines the outdoor scene of Fig. 15(b)

113
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 13. In each column, the top two images are respective the visual image and visual image, and images in the bottom five rows are respectively the fusion images
of top two source images by GFF, LPSR, MFCNN, MECNN and IFCNN-MAX.

and the indoor layout of Fig. 15(e) into the fusion image, which shows be caused by adding too much salient features of more than two images.
inappropriate region effect around the stitching area. In addition, While Fig. 15(k) indicates that IFCNN-MAX have addressed textural
MFCNN has not fully utilized the clear features of all source images, details so much that the fusion image shows edging effect, which might
thus the visual quality of its fusion image is far from perfect for visual impact the perception of human eyes. Through observing the closeups
perception. Fig. 15(j) shows that the fused outdoor scene of IFCNN-SUM of Figs. 15(l) and (m), we can see the fusion image of IFCNN-MEAN
seems over-exposed and thus loses much outdoor details, which might is a little brighter than that of BASELINE-MEAN, which is helpful for

114
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 14. The comparison example on the second pair of multi-modal medical images. (a) and (b) are respectively the CT image and MR image, and (c)-(j) are the
fusion results of (a) and (b) respectively by GFF, LPSR, MFCNN, MECNN, IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN-MAX. In each subfigure, the image
patch bounded by green box indicates the closeup of the image patch bounded by red box. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)

Table 3
Quantitative evaluation results on medical image dataset.

Metrics GFF LPSR MFCNN MECNN IFCNN-SUM IFCNN-MEAN BASELINE-MAX IFCNN-MAX

VIFF 0.6714(5,0,1) 0.7708(2,3,3) 0.7919(1,4,1) 0.5223(8,0,1) 0.6147(7,0,0) 0.6200(6,0,0) 0.6911(3,1,2) 0.6799(4,0,0)


ISSIM 0.4150(7,0,1) 0.4641(6,1,1) 0.5365(1,4,0) 0.2975(8,1,0) 0.4792(3,0,2) 0.4877(2,0,3) 0.4702(5,1,1) 0.4739(4,1,0)
NMI 0.6063(7,0,1) 0.6883(2,1,2) 0.9144(1,5,0) 0.5826(8,1,0) 0.6207(3,0,0) 0.6202(4,0,3) 0.6123(6,1,2) 0.6144(5,0,0)
SF 21.73(7,0,0) 24.13(3,1,1) 22.40(6,0,0) 14.69(8,0,0) 23.05(4,0,0) 22.78(5,0,0) 24.86(2,3,4) 24.98(1,4,3)
AG 2.892(7,0,0) 3.312(3,2,0) 2.898(6,0,0) 2.123(8,0,0) 3.164(4,0,0) 3.130(5,0,0) 3.340(2,3,4) 3.402(1,3,4)

Table 4
Quantitative evaluation results on multi-exposure dataset.

Metrics GFF LPSR MFCNN MECNN IFCNN-SUM IFCNN-MAX BASELINE-MAX IFCNN-MEAN

MESSIM 0.9204(1,6,0) 0.8043(5,0,0) 0.8041(6,0,0) 0.5947(8,0,1) 0.6396(7,0,0) 0.8151(4,0,0) 0.8713(3,0,1) 0.8786(2,0,4)


SF 25.93(7,0,0) 25.98(6,0,2) 26.21(5,0,0) 17.16(8,0,0) 32.48(2,2,2) 38.76(1,4,2) 29.49(4,0,0) 30.6458(3,0,0)
AG 3.645(5,0,0) 3.495(6,0,0) 3.494(7,0,0) 1.896(8,0,0) 3.815(4,1,2) 6.224(1,5,1) 4.623(3,0,2) 4.701(2,0,1)

the human eyes to grasp scene details. Therefore, our chief model our proposed four models rank the top four places. Specifically, IFCNN-
IFCNN-MEAN shows the best visual effect on this comparison example. MAX ranks first in both the SF and AG metrics and IFCNN-MEAN obtains
Afterwards, the quantitative evaluation is performed on the multi- the third place and second place respectively in the SF metric and AG
exposure image dataset according to the experimental settings, and the metric, which means IFCNN-MAX and IFCNN-MEAN have produced
evaluation results are listed in Table 4. Table 4 shows that GFF gets fusion images with more textural information than other algorithms.
the largest MESSIM value, which indicates that GFF has preserved the Compared to GFF, our chief model (IFCNN-MEAN) could not only
most structural information from input images. While our chief model preserve comparable amount of structural information, but also retain
IFCNN-MEAN ranks second in the MESSIM metric, and LPSR, IFCNN- more textural information from input images. Overall, the qualitative
MAX, MFCNN, MECNN and IFCNN-SUM have obtained relatively lower and quantitative evaluation results on the multi-exposure image dataset
values. As for the SF and AG metrics, the quantitative results show that imply that IFCNN-MEAN could integrate the visible features of suitable

115
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

Fig. 15. The comparison example on the fifth set of the multi-exposure images. (a)–(e) are respectively the source images of different exposure degrees, and (f)–(m)
are the fusion results of (a)–(e) respectively by GFF, LPSR, MFCNN, MECNN, IFCNN-SUM, IFCNN-MEAN, BASELINE-MAX and IFCNN-MAX. In each subfigure, the
image patch bounded by green box indicates the closeup of the image patch bounded by red box. (For interpretation of the references to color in this figure legend,
the reader is referred to the web version of this article.)

exposure-degree from input images into the fusion image, and could Table 5
perform comparably or even better than other algorithms. Time cost comparison (Size unit: pixel, Time unit: second).

Algorithms GFF LPSR MFCNN MECNN IFCNN


3.6. Time cost comparison
Image Size 520 × 520 520 × 520 520 × 520 512 × 384 520 × 520
Time Cost 0.33 0.20 0.33 0.07 0.02
As introduced in Section 2.5, our algorithm takes about 0.02 s to
produce one fusion image of size 520 × 520 from two input images on
the platform with Intel Core i7-3770k CPU and NVIDIA TITAN X GPU.
Since we have only tested the CPU version of MFCNN and MECNN, thus In pursuit of the clear comparison, time costs of the compared algo-
the time costs of MFCNN and MECNN are referred to the reports in Liu rithms are listed in Table 5, in which the shortest and second-shortest
et al. [31] and Prabhakar et al. [34]. MFCNN (the slight model) takes time costs are respectively highlighted as red and blue. Even though the
about 0.33 s to fuse two input images of size 520 × 520 on the platform evaluation platforms of IFCNN, MFCNN and MECNN are different, the
with Intel Core i7-4790k CPU and NVIDIA TITAN Black GPU. MECNN impact of this difference on running times should be not that significant.
takes about 0.07 s to fuse two input images of size 512 × 384 on the Thus, this comparison could still reflect that our algorithm is faster than
platform with Intel Xeon @3.5 GHz CPU and NVIDIA Tesla K20c GPU. At the current CNN models or even the classical transform domain algo-
last, GFF and LSPR respectively cost 0.33 s and 0.20 s to fuse two input rithms. Moreover, our IFCNN only occupies about 785 MB GPU memory
images of size 520 × 520 on our platform with Intel Core i7-3770k CPU. for fusing two input images of size 520 × 520. Therefore, it is very

116
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

convenient to deploy our image fusion model into the real-time surveil- as KITTI dataset [54] could probably increase the model’s performance.
lance systems without consuming too much computational resources. Secondly, the proposed model only consists of four convolutional layers,
therefore using deeper convolutional neural network has great potential
3.7. Conclusions on experimental results to further improve the model’s performance. Thirdly, the proposed
model is designed to fuse the registered images, thus adding an image
According to both qualitative and quantitative evaluation results on alignment module might enable the image fusion model to deal with
four types of image datasets, we can arrive at the following five conclu- the unregistered cases. Fourthly, in this paper, we have only utilized
sions: linear elementwise fusion rules to fuse the convolutional features of
multiple input images, thus incorporating more complex and powerful
• Our proposed chief models could achieve comparable or even feature fusion module can also boost the model’s performance. Finally,
better performance than the state-of-the-art image fusion our proposed model is designed as a general image fusion framework,
algorithms. thus its performance might be limited for fusing a specific type of
• IFCNN-MAX outperforms IFCNN-SUM and IFCNN-MEAN on images. Therefore, one practical way, to improve performance of the
three types of image dataset (multi-focus, infrared-visual and CNN based image fusion models, is to design the architecture according
multi-modal medical image datasets), thus IFCNN-MAX demon- to the specific characteristics of the target image dataset.
strates better generalization ability than IFCNN-MEAN and
IFCNN-SUM. Overall, the experiments verify that our chief
models (IFCNN-MAX and IFCNN-MEAN) own better general- Acknowledgments
ization ability than the existing models through comparing the
evaluation results on all the four image datasets. The authors are grateful to the anonymous reviewers and editors
• Due to the wide range of exposure degrees of the multi-exposure for their valuable comments on improving this paper’s quality. The
images, IFCNN-MEAN is more suitable to fuse the multi-exposure authors also would like to thank Bo Wang for sharing the infrared-
images than IFCNN-MAX. visual image dataset, and thank K. Ram Prabhakar for providing their
• Almost all results indicate that our chief models (IFCNN-MAX source code. This work is partly supported by the National Natural
and IFCNN-MEAN) outperform baseline models (BASELINE-MAX Science Foundation of China under Grant 61871248, Grant 61701160,
and BASELINE-MEAN), which implies the perceptual loss can Grant 61503405 and Grant U1533132.
boost the image fusion models to produce more informative
fusion images.
• The proposed models are light-weight and efficient, thus References
our models could be conveniently deployed in the real-time
[1] H. Li, B. Manjunath, S.K. Mitra, Multisensor image fusion using the wavelet trans-
surveillance systems. form, Graphical Models Image Process. 57 (3) (1995) 235–245.
[2] A.A. Goshtasby, S. Nikolov, Image fusion: advances in the state of the art, Inf. Fusion
8 (2) (2007) 114–118.
4. Conclusions [3] X. Bai, F. Zhou, B. Xue, Fusion of infrared and visual images through region ex-
traction by using multi scale center-surround top-hat transform, Opt. Express 19 (9)
In this paper, we have proposed a general image fusion framework (2011) 8444–8457.
[4] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE Trans. Image Process.
based on the convolutional neural network, which mainly has four 22 (7) (2013) 2864–2875.
advantages over the existing image fusion models: (1) Our model is fully [5] A.A. Goshtasby, Fusion of multi-exposure images, Image Vision Comput. 23 (6)
convolutional and thus can be trained in the end-to-end manner with- (2005) 611–618.
[6] R. Shen, I. Cheng, J. Shi, A. Basu, Generalized random walks for fusion of multi-ex-
out any post-processing procedures. (2) To finely train our model, we posure images, IEEE Trans. Image Process. 20 (12) (2011) 3634–3646.
have reasonably generated a large-scale multi-focus image dataset by [7] K. Ma, K. Zeng, Z. Wang, Perceptual quality assessment for multi-exposure image
rendering the partially-focused images varied with random depth range fusion, IEEE Trans. Image Process. 24 (11) (2015) 3345–3356.
[8] A. Saha, G. Bhatnagar, Q.J. Wu, Mutual spectral residual approach for multifocus
from the RGB and depth images of NUY-D2 dataset. Moreover, rather image fusion, Digital Signal Process. 23 (4) (2013) 1121–1135.
than no ground truth or using focus maps as ground truth in the existing [9] X. Bai, Y. Zhang, F. Zhou, B. Xue, Quadtree-based multi-focus image fusion using a
datasets, the source RGB images of NYU-D2 dataset naturally become weighted focus-measure, Inf. Fusion 22 (2015) 105–118.
[10] Q. Zhang, M.D. Levine, Robust multi-mocus image fusion using multi-task sparse
the ground-truth fusion images of our multi-focus image dataset,
representation and spatial context, IEEE Trans. Image Process. 25 (5) (2016)
which is of great importance for optimizing the essentially regressed 2045–2058.
image fusion models. (3) As our model is constructed similarly to the [11] G. Bhatnagar, Q.M.J. Wu, Z. Liu, Directive contrast based multimodal medical image
structure of the transform-domain image fusion algorithm, thus our fusion in nsct domain, IEEE Trans. Multimedia 15 (5) (2013) 1014–1024.
[12] Z. Xu, Medical image fusion using multi-level local extrema, Inf. Fusion 19 (2014)
model generally owns better generalization ability for fusing various 38–48, doi:10.1016/j.inffus.2013.01.001.
types of images without any finetuning procedures than the existing [13] Z. Xue, R.S. Blum, Concealed weapon detection using color image fusion, in: Pro-
image fusion models. (4) Owing to the existence of ground-truth fusion ceedings of the 6th International Conference on Information Fusion, Vol. 1, IEEE,
2003, pp. 622–627.
images, it is the first time to introduce perceptual loss to optimize the [14] T. Wan, N. Canagarajah, A. Achim, Segmentation-driven image fusion based on al-
image fusion models, which can boost models to produce fusion images pha-stable modeling of wavelet coefficients, IEEE Trans. Multimedia 11 (4) (2009)
with more textural details. Without finetuning the image fusion models 624–633.
[15] Z. Zhou, B. Wang, S. Li, M. Dong, Perceptual fusion of infrared and visible images
on other image datasets, the extensive experimental results on four through a hybrid multi-scale decomposition with Gaussian and bilateral filters, Inf.
types of image datasets validate that the proposed model demonstrates Fusion 30 (2016) 15–26.
better generalization ability for fusing various types of images than the [16] Y. Zhang, L. Zhang, X. Bai, L. Zhang, Infrared and visual image fusion through in-
frared feature extraction and visual information preservation, Infrared Phys. Tech-
existing models, and achieves comparable or even better fusion images
nol. 83 (2017) 227–237.
than the state-of-the-art image fusion algorithms. [17] Y. Zhang, X. Bai, T. Wang, Boundary finding based multi-focus image fusion through
This work sets pioneer foundation for the applications of the multi-scale morphological focus-measure, Inf. Fusion 35 (2017) 81–101.
[18] W. Huang, Z. Jing, Evaluation of focus measures in multi-focus image fusion, Pattern
convolutional neural network in the field of image fusion. However,
Recognit. Lett. 28 (4) (2007) 493–500.
even though the extensive experimental results have validated the [19] Z. Zhou, S. Li, B. Wang, Multi-scale weighted gradient-based fusion for multi-focus
proposed model’s advantages, there are still several points that should images, Inf. Fusion 20 (2014) 60–72.
be further addressed in order to obtain image fusion models with [20] P.J. Burt, E.H. Adelson, The laplacian pyramid as a compact image code, IEEE Trans.
Commun. 31 (4) (1983) 532–540.
better performance. Firstly, our multi-focus image dataset only contains [21] A. Toet, Image fusion by a ratio of low-pass pyramid, Pattern Recognit. Lett. 9 (4)
indoor images, thus extending the dataset with the outdoor images such (1989) 245–253.

117
Y. Zhang, Y. Liu and P. Sun et al. Information Fusion 54 (2020) 99–118

[22] J.J. Lewis, R.J. O’Callaghan, S.G. Nikolov, D.R. Bull, N. Canagarajah, Pixel- and [39] V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines,
region-based image fusion with complex wavelets, Inf. Fusion 8 (2 SPEC. ISS.) (2007) in: Proceedings of the 27th International Conference on Machine Learning, Omni-
119–130. press, 2010, pp. 807–814.
[23] F. Nencini, A. Garzelli, S. Baronti, L. Alparone, Remote sensing image fusion using [40] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by re-
the curvelet transform, Inf. Fusion 8 (2 SPEC. ISS.) (2007) 143–156. ducing internal covariate shift, in: Proceedings of the 32nd International Conference
[24] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multi-scale on Machine Learning, JMLR.org, 2015, pp. 448–456.
transform and sparse representation, Inf. Fusion 24 (2015) 147–164. [41] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from
[25] X. Bai, Infrared and visual image fusion through feature extraction by morphological error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004)
sequential toggle operator, Infrared Phys. Technol. 71 (2015) 77–86. 600–612.
[26] B. Yang, S. Li, Multifocus image fusion and restoration with sparse representation, [42] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and
IEEE Trans. Instrumen.Meas. 59 (4) (2010) 884–892. super-resolution, in: European Conference on Computer Vision, Springer, 2016,
[27] N. Yu, T. Qiu, F. Bi, A. Wang, Image features extraction and fusion based on joint pp. 694–711.
sparse representation, IEEE J. Sel. Top. Signal Process. 5 (5) (2011) 1074–1082. [43] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,
[28] S. Li, H. Yin, L. Fang, Group-sparse representation with dictionary learning for A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution us-
medical image denoising and fusion, IEEE Trans. Biomed. Eng. 59 (12) (2012) ing a generative adversarial network, in: Proceedings of the IEEE Conference on
3450–3459. Computer Vision and Pattern Recognition, IEEE, 2017, pp. 4681–4690.
[29] B. Yang, S. Li, Pixel-level image fusion with simultaneous orthogonal matching pur- [44] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
suit, Inf. Fusion 13 (1) (2012) 10–19. recognition, arXiv:1409.1556v1(2014).
[30] Y. Liu, X. Chen, Z. Wang, Z.J. Wang, R.K. Ward, X. Wang, Deep learning for pix- [45] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
el-level image fusion: recent advances and future prospects, Inf. Fusion 42 (2018) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
158–173. IEEE, 2016, pp. 770–778.
[31] Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convolu- [46] M. Nejati, S. Samavi, S. Shirani, Multi-focus image fusion using dictionary-based
tional neural network, Inf. Fusion 36 (2017) 191–207. sparse representation, Inf. Fusion 25 (2015) 72–84.
[32] H. Tang, B. Xiao, W. Li, G. Wang, Pixel convolutional neural network for multi-focus [47] Y. Liu, X. Chen, J. Cheng, H. Peng, A medical image fusion method based on con-
image fusion, Inf. Sci. (2017). volutional neural networks, in: 2017 20th International Conference on Information
[33] H. Song, Q. Liu, G. Wang, R. Hang, B. Huang, Spatiotemporal satellite image fusion Fusion, IEEE, 2017, pp. 1–7.
using deep convolutional neural networks, IEEE J. Sel. Top. Appl. Earth Obs. Remote [48] Y. Liu, Z. Wang, Dense sift for ghost-free multi-exposure fusion, J. Visual Commun.
Sens. 11 (3) (2018) 821–829. Image Represent. 31 (2015) 208–224.
[34] K.R. Prabhakar, V.S. Srikar, R.V. Babu, Deepfuse: a deep unsupervised approach [49] Y. Han, Y. Cai, Y. Cao, X. Xu, A new image fusion performance metric based on
for exposure fusion with extreme exposure image pairs, in: Proceedings of the IEEE visual information fidelity, Inf. Fusion 14 (2) (2013) 127–135.
International Conference on Computer Vision, IEEE, 2017, pp. 4724–4732. [50] C. Yang, J.-Q. Zhang, X.-R. Wang, X. Liu, A novel similarity based quality metric for
[35] H. Yan, X. Yu, Y. Zhang, S. Zhang, X. Zhao, L. Zhang, Single image depth estimation image fusion, Inf. Fusion 9 (2) (2008) 156–160.
with normal guided scale invariant deep convolutional fields, IEEE Trans. Circuits [51] M. Hossny, S. Nahavandi, D. Creighton, Comments on ‘information measure for per-
Syst. Video Technol. (2017). formance of image fusion’, Electron. Lett. 44 (18) (2008) 1066–1067.
[36] L. Li, S. Zhang, X. Yu, L. Zhang, Pmsc: patchmatch-based superpixel cut for accurate [52] S. Li, B. Yang, Multifocus image fusion using region segmentation and spatial fre-
stereo matching, IEEE Trans. Circuits Syst. Video Technol. (2016). quency, Image Vision Comput. 26 (7) (2008) 971–979.
[37] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support in- [53] W. Zhao, D. Wang, H. Lu, Multi-focus image fusion with a natural enhancement
ference from rgbd images, in: European Conference on Computer Vision, Springer, via joint multi-level deeply supervised convolutional neural network, IEEE Trans.
2012, pp. 746–760. Circuits Syst. Video Technol. (2018).
[38] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceed- [54] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the kitti dataset, Int.
ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, J. Rob. Res. 32 (11) (2013) 1231–1237.
pp. 2881–2890.

118

You might also like