Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Learning Binary Color Filter Arrays with Trainable Hard Thresholding

[Uncaptioned image] Cemre O. Ayna
Department of Electrical & Computer Engineering
Mississippi State University
Mississippi State, MS, 39759
ca1389@msstate.edu
   [Uncaptioned image] Bahadir K. Gunturk
Department of Electrical-Electronics Engineering
Medipol University
Istanbul, Turkey, 34810
bkgunturk@medipol.edu.tr
   [Uncaptioned image] Ali C. Gurbuz
Department of Electrical & Computer Engineering
Mississippi State University
Mississippi State, MS, 39759
gurbuz@ece.msstate.edu
This work was supported by the National Science Foundation under Grant No. 2047771 (Corresponding author: Ali C. Gurbuz.) Dr. Ali Cafer Gurbuz and Cemre Omer Ayna are with the Department of Electrical and Computer Engineering at Mississippi State University, MS-39762, US. (email: gurbuz@ece.msstate.edu). Dr. Bahadir Gunturk is with the Istanbul Medipol University (email: bkgunturk@medipol.edu.tr).
Abstract

Color Filter Arrays (CFA) are optical filters in digital cameras that capture specific color channels. Current commercial CFAs are hand-crafted patterns with different physical and application-specific considerations. This study proposes a binary CFA learning module based on hard thresholding with a deep learning-based demosaicing network in a joint architecture. Unlike most existing learnable CFAs that learn a channel from the whole color spectrum or linearly combine available digital colors, this method learns a binary channel selection, resulting in CFAs that are practical and physically implementable to digital cameras. The binary selection is based on adapting the hard thresholding operation into neural networks via a straight-through estimator, and therefore it is named HardMax. This paper includes the background on the CFA design problem, the description of the HardMax method, and the performance evaluation results. The evaluation of the proposed method includes tests for different demosaicing models, color configurations, filter sizes, and a comparison with existing methods in various reconstruction metrics. The proposed approach is tested with Kodak and BSDS500 datasets and provides higher reconstruction performance than hand-crafted or alternative learned binary filters.

Keywords color filter array  \cdot hard thresholding  \cdot measurement learning  \cdot straight-through estimator  \cdot deep learning  \cdot demosaicing

1 Introduction

A digital camera captures an image by exposing its sensor array in which each sensor corresponds to one pixel in the final image to the incoming light for a certain amount of time. A Color Filter Array (CFA) is an optical filter placed on a camera sensor array. Sensors alone cannot differentiate between individual colors; instead, CFAs facilitate capturing color information by sifting only one frequency band in the visible light spectrum corresponding to the selected color per pixel. The raw input of the filtered camera sensor array corresponds to an image in which each pixel contains the intensity information of only one color channel and lacks the rest.

Virtually all available commercial CFAs are designed by hand with different considerations depending on the camera and environment characteristics. The most commonly used CFA pattern is the Bayer filter [1]. Several other hand-crafted CFAs are also present in specific camera models such as Lukac filter [2], Kodak’s CYYM filter, Fujifilm’s X-Trans [3], CMWY filter, and Compton’s RGBW filter along with Kodak’s RGBW filter variations [4]. The hand-crafted filters used in this study for evaluation are illustrated in Figure 1. The study [2] provides an extensive review and analysis of the importance of CFA design on the final image.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: The fixed hand-crafted CFA examples used in this study for evaluation: (a) Bayer, (b) Lukac, (c) RGBW, and (d) CFZ filters. Each pattern is extended to 8×8888\times 88 × 8 size for visualization.

The process following the color filtering operation in digital image acquisition is known as demosaicing, which is estimating the unknown color values in raw camera returns [5]. Common demosaicing algorithms are mainly spatial or frequency domain interpolation-based techniques [6, 7]. These algorithms show variation for specific applications and different CFA types [8, 9, 10]. Detailed reviews of the classical demosaicing algorithms for various CFAs are available in [11, 12]. A unique CFA design requires a dedicated demosaicing algorithm depending on the filter pattern and available information from the captured scene.

With the emergence of machine learning (ML) and neural networks (NN) in computational imaging, various studies that suggest using NNs in demosaicing or joint demosaicing-denoising pipelines [13, 14, 15, 16, 17] have appeared. These approaches work with raw camera return and propose various NN architectures for mapping the inputs to the full-color image. These methods show that enhanced reconstruction quality and computational speed can be achieved with deep learning (DL).

Recent studies applied ML solutions for learning a CFA pattern to address the issue of exploiting the features of natural images for high-quality full-color image reconstruction [18, 19]. Although working in the RGB domain, these solutions learn CFAs that employ the full digital color spectrum; their process learns a linear combination of all three color channels. This approach results in learning a unique color per sensor. Although weighted combinations of colors provide enhanced reconstructions compared to fixed CFAs like Bayer, these learned filters are impractical for physical implementation in commercial cameras where each pixel reads a single color from the digital color configuration. Some other studies assume working in the multispectral domain and learning MultiSpectral Filter Arrays (MSFA) [20, 21]. Although there is active research for building commercial cameras with MSFAs, these cameras are still in the prototype phase due to their high production cost and the raw image formats requiring additional operations to get the usual RGB images as the final product. Section 2.1 includes a more detailed discussion on the applicability issues of MSFAs.

Modern commercial digital cameras use color configurations with a few colors, with most of them opting for RGB configuration. For this reason, it is necessary to develop learned binary CFAs that utilize only one color channel at each pixel and still provide enhanced image reconstruction compared to hand-crafted CFAs. To the best of our knowledge, only one method presents a way to learn a true binary CFA pattern in RGBW configuration [22]. This study adapts SoftMax operation with a scalar value that increases in time exponentially in order to acquire quasi-binary filter weights from output weights.

This paper proposes an alternative method for learning a binary CFA in a joint filter-demosaicer architecture. The proposed joint framework is an end-to-end architecture with two modules; the head module learns a constrained binary CFA during training, and the tail module reconstructs a color image from filtered raw camera images. This joint learning approach enforces the learned binary CFA to be optimal for color image demosaicing. The proposed method adapts hard thresholding as band selection operation in the CFA learning module that is compatible with stochastic gradient descent. The CFAs learned with this method can be used in camera sensor arrays without the impracticality concerns since the selection of only one color channel is enforced for each pixel. Our results indicate CFAs learned with this approach provide a higher reconstruction performance than the hand-designed filters and the alternative proposed in [22]. With reference to the hard thresholding as the basis of this process, we named our binary CFA learning module HardMax. The novelties of this study can be described in the following points.

  • This study presents an NN model for joint color filtering - demosaicing modules with a novel binary CFA learning mechanism and an indigenous high-performing demosaicer architecture.

  • The proposed CFA learning method (HardMax) is adaptable to different joint DL architectures, allowing us to learn optimum CFAs for different objectives of the architectures following the CFA module such as reconstruction or classification.

  • Unlike other proposed methods, HardMax aims to find an optimum binary CFA so that the learned CFA can be easily applied to commercial digital cameras.

  • This study includes the evaluation and analysis of different parameters that affect the learned CFAs, such as filter size, color configuration, and training data size.

The rest of the paper follows this structure; Section 2 gives a background on the available literature on the machine learning-based CFA design. Section 3 describes the proposed CFA learning and demosaicing method along with the training and evaluation procedures. Section 4 describes the dataset, training process, and evaluation. Section 5 presents results, the obtained performance metrics, and the comparison with the existing approaches. Section 6 includes discussion on the proposed approach and results, the shortcomings of the study, and the potential road map for future work. Finally, section 7 draws the conclusions.

2 Background

2.1 Machine Learning in CFA Design

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Visualization of the CFA learning methods used in this study for comparison. (a) Linear [19], (b) Weighted SoftMax [22]. Figures show the RGBW configuration.

Compared to demosaicing, the volume of studies on the ML-based CFA design problem is small, with the recent literature focusing on MSFAs. Some of these methods are designed for hyperspectral image acquisition [23, 24, 25], while others target commercial digital cameras [20, 21, 26]. However, multispectral image acquisition has several issues that prevent it from being used in commercial digital cameras. The first problem is the high cost of multispectral cameras due to the complexity of their production. The second issue is that a higher number of frequency bands causes the curse of dimensionality for a full-color image. Thirdly, the larger interval of missing channel values corresponds to lower resolution in the final product and more complex demosaicing algorithms. For these reasons, this study focuses on solutions for CFAs that use the available number of colors in digital image (usually three chromatic and one luminance “white” channel).

The lack of available literature on ML-based CFA design is even more striking when the literature search is constrained to the RGB case. There is only a handful of studies for modern ML techniques in CFA design [19, 27]. As an example, Henz et al. [19] introduced an autoencoder architecture as a joint filtering-demosaicing pipeline where the encoder is composed of a 3D tensor with weights for each color channel (three nonnegative unconstrained weight sets for each color channel) plus an unconstrained bias term used in a matrix multiplication with the full-color image in the training process (see Figure 2(a)). This method is included in the evaluation and will be called Linear filter in the rest of the paper for convention. Li et al. [27] propose a CFA design algorithm based on representing CFAs as overcomplete dictionaries that sample an original image and finding the set of dictionary values that minimize the mutual coherence value, for mutual coherence is an important factor in signal reconstruction and smaller coherence corresponds to higher quality reconstruction. In practice, Li’s filter corresponds to a similar linear projection operation as in [19] without a bias term.

Even though the studies [19, 27] define their search domain on the RGB channel, their final product is not a purely binary RGB filter but a filter with different linear combinations of available channels for each pixel. That is due to the mentioned studies defining their CFA learning process as a linear sum operation on the RGB spectrum. In return, the unconstrained linear combination of non-negative weights leads to a quasi-infinite number of potential colors for selection. In order to learn a CFA from available color channels, the CFA learning problem must be redefined as a constrained selection problem rather than a weighted linear sum. There is only one study known to us that adopts such a strategy; Chakrabarti’s method in [22] uses a soft thresholding adaptation of binary selection by applying SoftMax on a set of weights for each color channel per pixel (Figure 2(b)). The SoftMax operation is also controlled by a separate scalar value for weights before the SoftMax operation. The scalar is initialized as a small value. As training progresses, this scalar grows to binarize the elements of the output vector. This scalar value increases as training processes, stretching the SoftMax output more into the limit values of 0 and 1. For convenience, this approach is named as Weighted SoftMax in this paper.

A recent study [28] learns a separate CFA but essentially uses the same CFA learning scheme as in [22]. Several regularization enforced binary coded aperture learning mechanisms are yet to be adapted to color filter selection [29, 30]. This paper aims to introduce a new approach for learning a constrained binary CFA that leads to a higher reconstruction performance than the available solution.

2.2 DL-Based Demosaicing

There is a plethora of demosaicing algorithms designed to work with different CFA filters or address problems in image recovery. The classical approaches can be grouped into three categories according to their strategies [11]; complex interpolation algorithms accounting geometry or optics [31, 6], heuristic algorithms that build the digital image upon iterative recovery (for instance, recovering luminance, then chrominance) [32, 33], or statistical models assuming an interpixel or interchannel dependency (like sparsity-based interpolation methods) [9, 34]. Because of the lack of information about the interpixel and interchannel dependency and the high variability of these values, each approach has its drawbacks and issues.

The DL-based demosaicing methods emerged after the neural networks proved to be efficient function approximators in many computer vision applications. It is important to note that the demosaicing problem is an undetermined reconstruction problem [35], and this fact is exploited heavily by the sparse representation-based demosaicing algorithms that employ dictionary search or other methods developed for CS [9, 34]. For the same reason, the available DL-based demosaicing models share similarities with the DL-based image reconstruction and super-resolution models.

Here, we present the three DL-based demosaicing models selected for comparison during the evaluation of our demosaicing model in this study. The reason for selecting these studies is that just like our proposed framework all three models are used alongside a CFA learning algorithm in their respective studies. Common to all approaches, we assume an image patch of the size N×N×C𝑁𝑁𝐶N\times N\times Citalic_N × italic_N × italic_C as the final output, where N𝑁Nitalic_N is the size of both the width and the height, and C𝐶Citalic_C is the number of channels. This patch is one of the all non-overlapping adjacent patches extracted from the original image.

The first demosaicing architecture is found in [19] (Linear) and it was inspired by autoencoders and fully convolutional network architectures. The input is an N×N×(2C+1)𝑁𝑁2𝐶1N\times N\times(2C+1)italic_N × italic_N × ( 2 italic_C + 1 )-sized feature map, which includes C𝐶Citalic_C number of individual color channels (called submosaics), the C𝐶Citalic_C number of interpolated submosaics with a k-neighboring kernel, and the monochromatic raw image. The separate inclusion of submosaics and their interpolated version is to help the network to skip the procedure of interpolation and channel separation in order to let the model focus on the reconstruction. The interpolations of submosaics are created and concatenated to the raw image with an interpolation kernel between the filtering (encoding) and demosaicing (decoding) operations. The rest of the demosaicing model is a fully convolutional network with 12 layers in total. The kernel size is fixed to 3×3333\times 33 × 3. The first six layers have 64 kernels, while the last six have 128. At the end of the model, the input raw camera sensor matrix is concatenated with the output of the last convolutional layer. Then, this tensor is passed through a final convolutional layer to get the reconstructed full-image patch.

The second demosaicing architecture is adapted from the compared alternative study (Weighted SoftMax) [22]. In this architecture, the input is a 3N×3N×13𝑁3𝑁13N\times 3N\times 13 italic_N × 3 italic_N × 1-size raw camera measurement of a N×N𝑁𝑁N\times Nitalic_N × italic_N image patch with its neighboring area, and the output is a N×N×C𝑁𝑁𝐶N\times N\times Citalic_N × italic_N × italic_C-size reconstructed central full-color image block. The reason for using surrounding patches is to use the information around the central patch to reinforce the reconstruction quality and prevent artifacts. The demosaicing architecture includes two parallel streams creating color channel priors. The first stream consists of a fully connected (FC) layer with P×P×3K𝑃𝑃3𝐾P\times P\times 3Kitalic_P × italic_P × 3 italic_K number of neurons followed by a reshaping operation and a 1×1111\times 11 × 1 convolutional layer. The purpose of this stream is to extract all the color information from the raw sensor readings. In the original study, the FC layer is preceded by a natural logarithm and succeeded by exponential operation. In the evaluations, this approach caused the training accuracy to fluctuate and even diverge from a solution; thus, we had to discard it. The second stream is an encoder composed of a group of convolutional layers with an F𝐹Fitalic_F number of kernels and ReLU activation function, followed by an FC layer with N×N×3K𝑁𝑁3𝐾N\times N\times 3Kitalic_N × italic_N × 3 italic_K neurons and a reshape operation. This stream’s purpose is to capture spatial features independent of the color channel to augment the estimation of the absent color values. In our comparison, F𝐹Fitalic_F and K𝐾Kitalic_K values are 128 and 32. The same values were used in this study.

The third demosaicing architecture is based on a DL-based demosaicing model proposed in [14]. In the suggested design, the raw camera return is processed in two different modules to reconstruct two different information: low-frequency color (chrominance) and high-frequency shapes (luminance). The luminance reconstruction network returns a single matrix intended to learn the grayscale information which carries most of the low-level information, such as edges and patterns. This network consists of only one hidden convolutional layer and one output convolutional layer. Since there are no further details on this network, we chose the filter sizes for all layers as 3×3333\times 33 × 3, while the hidden layer has 64 filters and the output layer has one filter. On the other hand, the chrominance network is an autoencoder and returns an N×N×3𝑁𝑁3N\times N\times 3italic_N × italic_N × 3 size output. Like the luminance network, the original paper does not mention the actual architecture. For this reason, this study devised an autoencoder with 3 convolutional layers and 3 deconvolutional in total. The first three layers have 64, 128, and 256 3×3333\times 33 × 3 sized kernels with 2×2222\times 22 × 2 stride respectively. The three deconvolutional layers following this recover the same shape to create the N×N×3𝑁𝑁3N\times N\times 3italic_N × italic_N × 3 color information matrix. The outputs of the luminance and the chrominance networks are then summed up to create the final reconstruction.

3 Proposed Method

The proposed HardMax layer for learning CFA along with the demosaicer model is an extension of the study presented in [36]. In this section, we will explain the details of the binary CFA learning algorithm, the architecture of the proposed demosaicer model, and the use of both modules in a joint framework.

Refer to caption
Figure 3: The full joint model in forward passing in the RGBW configuration. The output of the color filtering module represents the raw sensor input and is passed to the demosaicing module as input.

3.1 Joint Binary CFA Learning and Demosaicing Architecture

An important aspect of DL-based solutions is their capability to combine multiple optimization problems into a single framework. For our framework, we define two separate objectives. The primary objective is to learn a binary CFA, and the secondary objective is to reconstruct a color image from raw sensor inputs with the given binary CFA. The goal is to achieve both these objectives together in a single DL architecture that learns binary CFAs that result in high reconstruction performance. Mainly, joint frameworks have the advantage of working in a combined search space, therefore eliminating the risk of overshooting in the overall process while performing singular tasks as well as separate models. Different versions of joint CFA and demosaicing solutions have also been used in the studies [19, 22] as detailed in Section 2.

The full neural network model proposed in this study is a combination of the binary CFA learning module detailed in Section 3.2 and the demosaicing model based on image reconstruction networks described in Section 3.3. Figure 3 shows a visual representation of the full model in a single forward propagation. During training, the full model takes an N×N×C𝑁𝑁𝐶N\times N\times Citalic_N × italic_N × italic_C image patch x𝑥xitalic_x as input and returns a reconstruction of the image patch x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG as output. The first module of the joint model is a binary CFA learning module that acts as a simulated color filter and returns an output y𝑦yitalic_y which corresponds to raw camera sensor observations. This output is then passed into the demosaicing model for reconstruction of the full-color image. The pseudoimage x^(0)superscript^𝑥0\hat{x}^{(0)}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT returned by the first convolution operation of the demosaicing module is refined by consecutive refinement submodules. The complete architecture is learned jointly by minimizing a common loss function enforcing mean square error between reconstructed and labeled images to be smaller.

The main consideration in any end-to-end DL framework is guaranteeing differentiability in every step. The current method of solution searching in NNs involves stochastic gradient descent (SGD), an update algorithm that uses the gradient operation in calculating weight change increments. This is a critical problem specifically in the binary CFA learning module as it will be detailed in the next section.

3.2 HardMax: The Binary CFA Learning Module

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Description of the HardMax filter in (a) forward propagation phase and (b) backpropagation phase during training. The binary selection is expanded as thresholding by the maximum weight value and normalization operations in the network.

The problem of adapting the binary selection operation in neural networks originates from the thresholding function being non-differentiable. A solution to the problem can be achieved with the Straight-Through Estimator (STE). STE is gradient estimation for functions without differentials in potential parameter points [37]. Assigning an STE to a non-differentiable function begins with observing the function’s behavior in the concerned domain, then looking for a set of potential functions which are following an output similar to the actual function and are differentiable in that domain. The basic idea of our application was based on the study in [38]. In the case of binary selection, we redefine the process first as hard thresholding followed by normalization. In hard thresholding, the gradient is carried only when the weight value is higher than the threshold value, and this value is equal to one. An alternative function that behaves exactly the same in the relevant domain is the identity function. Therefore, the identity function’s derivative (i.e., just the value 1) can be selected as the STE of the hard thresholding operation.

With a solution for direct implementation of hard thresholding provided, we assume the problem of sampling an image block with C𝐶Citalic_C channels and N×N𝑁𝑁N\times Nitalic_N × italic_N size. The objective of the learnable discrete CFA module is to select one channel in every pixel and discard the rest, creating N×N𝑁𝑁N\times Nitalic_N × italic_N number of measurements as an output. The HardMax module manages this by initializing a N×N×C𝑁𝑁𝐶N\times N\times Citalic_N × italic_N × italic_C sized tensor with uniformly random values between the interval [0,1]01[0,1][ 0 , 1 ] as real-valued weights. The weights don’t have to be constrained into a specific interval, and their sign is irrelevant since the actual information encoded with the weights is their relative greatness to each other. The highest weight value in each pixel represents the selected channel. This weight is defined as the threshold value. All pixel weights are passed into thresholding, as shown in Fig 4. The resulting vector is then binarized to convert the actual weight value into a binary mask.

The overall masking process is simple to implement, and with a preassigned gradient value, the backpropagation operation is computationally efficient with N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT operations in total. For this reason, HardMax takes small computational resources and time both in training and evaluation.

3.3 Demosaicing Module

Refer to caption
Figure 5: Description of the proposed demosaicer model. The final loss includes intermediate reconstructions.

The HardMax module is designed to be the head of a larger joint network. This makes the HardMax adaptable to any DL-based demosaicing model as a preliminary learnable filter. For this study, we propose a new demosaicing model; a multi-stage convolutional neural network that learns to create a pseudoimage and to refine it through multiple refinement blocks with a cumulative loss function of refinement blocks.

Our demosaicing architecture is illustrated in Figure 5. The main structure is designed with ideas from image reconstruction models [39] and [40]. The demosaicing model takes the 3N×3N×C3𝑁3𝑁𝐶3N\times 3N\times C3 italic_N × 3 italic_N × italic_C-size output of HardMax module (representing the sparse raw camera returns) as input and returns a 3N×3N×33𝑁3𝑁33N\times 3N\times 33 italic_N × 3 italic_N × 3-size reconstructed RGB image, with the desired N×N×C𝑁𝑁𝐶N\times N\times Citalic_N × italic_N × italic_C reconstruction in the middle. The reason for the larger input-output patches is to counter the artifact effects that occur at the edges of reconstructed patches, as was also used in [22] and yields good results.

The model first converts the raw camera returns into a pseudo image, similar to [19]. But instead of a hand-picked interpolation kernel as in [19], our model creates a pseudo image by applying a convolutional layer with 3 kernels of 9×9999\times 99 × 9 size with padding for equal-sized output and no bias. The coarse reconstruction is enforced by including the pseudoimage and the consecutive refinement outputs into the loss function as extra terms enforcing them to be similar to the labeled images as well.

The pseudoimage is then pushed into three consecutive refinement submodules, each of which is composed of three convolutional layers followed by ReLU activation functions. The first layer has 128 kernels of 7×7777\times 77 × 7 size, the second layer has 64 kernels of 5×5555\times 55 × 5 size and the third layer has 3 kernels of 3×3333\times 33 × 3 size. All layers apply padding to preserve the image size and include an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularizer. Between each refinement block, there are skip connections added to carry over the gradient. The model’s final layer is another convolutional layer with 3 kernels of 3×3333\times 33 × 3 size and a ReLU activation function.

The model’s final loss function is defined as the sum of all mean squared error (MSE) loss computed over the total T𝑇Titalic_T training image patches in a training batch, each refinement module outputs, and the pseudo image using the corresponding label image, as given in (1)

total=i=1Tn=04(xix^i(n))2,subscript𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑖1𝑇superscriptsubscript𝑛04superscriptsubscript𝑥𝑖superscriptsubscript^𝑥𝑖𝑛2\mathcal{L}_{total}=\sum_{i=1}^{T}\sum_{n=0}^{4}(x_{i}-\hat{x}_{i}^{(n)})^{2},caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i-th training sample and x^i(n)superscriptsubscript^𝑥𝑖𝑛\hat{x}_{i}^{(n)}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT is the pseudoimage or the n-th refinement module output as shown in Fig. 5. x^i(4)superscriptsubscript^𝑥𝑖4\hat{x}_{i}^{(4)}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 4 ) end_POSTSUPERSCRIPT denotes the final reconstructed image output of the model.

In our evaluations, we tested the joint binary CFA learning and demosaicing model with three existing DL-based demosaicing models (explained in Section 2.2) by only replacing the proposed demosaicing model. In this way, we show that the joint architecture allows the proposed binary CFA learning module to be used with different demosaicing models, and compare the effectiveness of demosaicer models on the learned filter and the overall performance. In addition, any future enhanced DL-based demosaicing model can be integrated with the binary CFA learning module both to learn enhanced CFA filters and achieve better reconstruction results.

4 Dataset and Training

4.1 Dataset and Setup

The training dataset used in this study is created from the training and validation images of the BSDS500 dataset [41] totaling 400 images. BSDS500 is a collection of 481×321481321481\times 321481 × 321 px-sized RGB images originally published for segmentation benchmarks but can be found commonly as training data in other computer vision tasks, including image demosaicing. The training image patches are 3N×3N3𝑁3𝑁3N\times 3N3 italic_N × 3 italic_N non-overlapping side-by-side blocks from this dataset, and the total number of the training patches is 881,600881600881,600881 , 600 in the case of N=8𝑁8N=8italic_N = 8 patch size.

We used two different test image datasets. The first dataset is composed of 20 images selected from BSDS500’s test images chosen based on their complexity in contour and texture. The second test dataset used in this study is the Kodak dataset [42] which is composed of 24242424 different images with 768×512768512768\times 512768 × 512 sizes. Performance metrics of individual datasets were recorded and compared separately.

4.2 Training

Training and evaluation of all models were conducted with an Nvidia A6000 GPU. The programming language selected for the implementation of the source code is Python 3.10 with TensorFlow library and Keras module. The loss function used in all demosaicing models is the mean squared error (MSE) loss as defined in (1). The Adaptive Movement Estimation (ADAM) optimizer is used during backpropagation [43] with decay rates (β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and division constant (ϵ^^italic-ϵ\hat{\epsilon}over^ start_ARG italic_ϵ end_ARG) values of ADAM optimizer were selected as 0.90.90.90.9, 0.9990.9990.9990.999 and 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT respectively. The learning rate follows an exponential decay from the value from 0.0001 to 0.00001.

The source code used for training and testing the proposed HardMax model along with the compared models can be found in the GitHub repository.

4.3 Evaluation

The performance metrics used in the performance evaluation of the proposed approach are the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Metric (SSIM). These metrics are profound in the digital image processing literature for image reconstruction performance evaluations. PSNR is used to observe the distortion of the reconstruction compared to the original signal in terms of decibel units and a higher PSNR value indicates a better reconstruction performance. PSNR can be computed as in (2) where the mean squared error (MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E) is computed over the whole reconstructed image.

PSNR=10log10(2552/MSE)𝑃𝑆𝑁𝑅10subscript10superscript2552𝑀𝑆𝐸PSNR=10\log_{10}(255^{2}/{MSE})italic_P italic_S italic_N italic_R = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( 255 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_M italic_S italic_E ) (2)

SSIM is used to compare a reconstruction with its ground truth signal in terms of localized similarity. An SSIM score takes a value between 0 and 1, where a higher score indicates that the reconstructed image resembles the original image more. Calculation of SSIM involves the means and variances of the original (μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) and the reconstructed (μx^subscript𝜇^𝑥\mu_{\hat{x}}italic_μ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT, σx^2superscriptsubscript𝜎^𝑥2\sigma_{\hat{x}}^{2}italic_σ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) images, along with the covariance of both (σxx^subscript𝜎𝑥^𝑥\sigma_{x\hat{x}}italic_σ start_POSTSUBSCRIPT italic_x over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT). The division stabilizer constants c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the SSIM definition in (3) are chosen as (2550.01)2superscript2550.012(255*0.01)^{2}( 255 ∗ 0.01 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and (2550.03)2superscript2550.032(255*0.03)^{2}( 255 ∗ 0.03 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT respectively.

SSIM=(2μxμx^+c1)(2σxx^+c2)(μx2+μx^2+c1)(σx2+σx^2+c2)𝑆𝑆𝐼𝑀2subscript𝜇𝑥subscript𝜇^𝑥subscript𝑐12subscript𝜎𝑥^𝑥subscript𝑐2superscriptsubscript𝜇𝑥2superscriptsubscript𝜇^𝑥2subscript𝑐1superscriptsubscript𝜎𝑥2superscriptsubscript𝜎^𝑥2subscript𝑐2SSIM=\frac{(2\mu_{x}\mu_{\hat{x}}+c_{1})(2\sigma_{x\hat{x}}+c_{2})}{(\mu_{x}^{% 2}+\mu_{\hat{x}}^{2}+c_{1})(\sigma_{x}^{2}+\sigma_{\hat{x}}^{2}+c_{2})}italic_S italic_S italic_I italic_M = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUBSCRIPT italic_x over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG (3)

5 Results

The proposed joint binary CFA and demosaicing model is evaluated under six different scenarios. They are briefly summarized as follows:

  • The first test in Sec. 5.1 compares the performance of the proposed demosaicer with alternative demosaicers outlined in Sec. 3.3.

  • The second test in Sec. 5.2 evaluates the effect of the CFA size in the final reconstruction quality. The learned CFAs for each size under both RGB and RGBW configurations are provided.

  • The third test in Sec. 5.3 compares the performance of the CFA learned by the proposed HardMax module with respect to the popular hand-crafted fixed CFAs such as Bayer, Lukac, and others.

  • The fourth test in Sec. 5.4 compares the performance of the proposed architecture with respect to the alternative CFA learning studies in [22] and [19].

  • The fifth test in Sec. 5.5 analyzes the convergence of the proposed method and the change of the learned CFAs during the training progress. The learned filters at different epochs and the corresponding image reconstructions and performance metrics are provided.

  • The sixth and final test in Sec. 5.6 evaluates the effect of the training data size on both the learned CFAs and the final reconstruction quality under cases from 250 thousand up to 1 million training image patches.

5.1 Comparison of Demosaicing Models

In the first analysis, we compare the performance of the proposed demosaicing model with the existing models [19, 22, 14] presented in Section 2.2. All demosaicing models are used in the joint framework by keeping the HardMax CFA learning module fixed and changing only the demosaicing models. A total of eight different joint models are trained and tested with the combination of four different demosaicing modules for both color configurations (RGB and RGBW). The average PSNR and SSIM metrics for Kodak and BSDS500 test datasets are provided in Table 1, with results for RGB and RGBW configurations presented separately.

Demosaicing Model in [19] in [22] in [14] Proposed
Kodak PSNR 38.321 37.738 32.252 40.451
SSIM 0.9767 0.9714 0.9366 0.9844
BSDS500 PSNR 38.247 36.721 30.591 40.054
SSIM 0.9853 0.9774 0.9353 0.9897
(a) RGB
Demosaicing Model in [19] in [22] in [14] Proposed
Kodak PSNR 40.632 37.959 33.359 41.881
SSIM 0.9863 0.9739 0.9458 0.9871
BSD500 PSNR 40.561 36.919 31.168 41.181
SSIM 0.9919 0.979 0.9392 0.9918
(b) RGBW
Table 1: Comparison of different DL-based demosaicing models in [19, 22, 14] used alongside the proposed HardMax CFA module for (a) RGB, and (b) RGBW configurations. The best performance for each metric is shown in bold.

The proposed demosaicing architecture shows the highest PSNR and SSIM values for both RGB and RGBW cases. Respectively for Kodak and BSDS500 test datasets, it provides 2.23 dB and 1.81 dB higher PSNR in RGB case, and 1.25 dB and 0.62 dB higher PSNR in RGBW case compared to the second highest demosaicer, the model in [19]. For the SSIM metric, the proposed model and the model in [19] perform better than the other two, with the proposed model surpassing in most of the test cases.

Categorically we observe that the models with feed-forward fully convolutional architectures (i.e., proposed model and [19]) perform comparably better than models with parallel architectures separating the reconstruction of color and texture information such as [22] and [14]. The proposed approach adapts deep learning-based image reconstruction to the demosaicing problem and unlike the model in [19] that use a hand-crafted kernel for an initial reconstruction, our model rather lets a dedicated convolutional layer learn to reconstruct a pseudo-image, and it uses skip connections and a combined final loss function to reinforce the reconstruction quality. Given the enhanced performance of the proposed demosaicing model, the rest of the evaluation scenarios will use the proposed demosaicing model in the joint architecture along with the HardMax CFA learning module.

5.2 Analysis on the CFA Size

In the second analysis, we investigate the effect of the CFA size on reconstruction performance. In this analysis, four different CFA sizes (4×4444\times 44 × 4, 8×8888\times 88 × 8, 12×12121212\times 1212 × 12, and 16×16161616\times 1616 × 16) have been tested with the proposed joint model under both RGB and RGBW color configurations. The achieved average PSNR and SSIM metrics over the test datasets for varying CFA sizes are presented in Table 2, with results for RGB and RGBW configurations presented separately.

CFA Size 4x4 8x8 12x12 16x16
Kodak PSNR 38.561 40.451 39.535 38.865
SSIM 0.9777 0.9844 0.9824 0.9813
BSD500 PSNR 37.57 40.054 38.676 37.880
SSIM 0.9826 0.9897 0.9873 0.9856
(a) RGB
CFA Size 4x4 8x8 12x12 16x16
Kodak PSNR 40.189 41.881 39.89 38.988
SSIM 0.9842 0.9871 0.9825 0.9814
BSD500 PSNR 39.700 41.181 38.59 37.878
SSIM 0.9891 0.9918 0.987 0.9848
(b) RGBW
Table 2: Comparison of different CFA sizes for (a) RGB and (b) RGBW configurations.

Our first observation from the Table 2 is that the highest PSNR and SSIM performance is achieved for 8×8888\times 88 × 8 CFA size in both RGB and RGBW color configurations. Using smaller or larger CFA sizes than 8×8888\times 88 × 8 affects the final reconstruction negatively. Another important point is that the existence of a luminance channel boosts the final performance, leading to higher PSNR and SSIM results. This means that color interpolation with sparse representation is possible if the luminance information is present.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Figure 6: CFAs with different sizes learned by the joint HardMax CFA optimization and the proposed demosaicing model. (a) 4×4444\times 44 × 4, (b) 8×8888\times 88 × 8, (c) 12×12121212\times 1212 × 12, (d) 16×16161616\times 1616 × 16 filters for RGB configuration; (e) 4×4444\times 44 × 4, (f) 8×8888\times 88 × 8, (g) 12×12121212\times 1212 × 12, (h) 16×16161616\times 1616 × 16 filters for RGBW configuration.

The learned CFAs for each different size are visualized in Fig. 6 for both RGB and RGBW configurations. While learned filters don’t exactly match with any existing fixed CFA pattern, they seem to show some level of regularity in sampling a color channel. One observation is that the learned CFAs seem to prioritize the blue color over the others in most instances. The second most common channel is red for RGB and luminance for RGBW cases. This is an interesting contrast to the intuition of the hand-crafted CFAs such as Bayer, Lukac, and X-Trans where the green channel is prioritized with the assumption that the human eye is more sensitive to green channel information. While the inherent features of digital images might have enforced the CFA learning module to pick up blue and white channels to minimize the defined mean squared loss term, we believe it is important to understand the reasons DL models are making these selections.

It is also important to note that this study deals particularly with the image reconstruction problem, hence the learned CFAs are objective-specific and trained and optimized for the reconstruction task. Therefore the HardMax CFA learning module combined with a DNN model for a different objective may result in prioritizing to select different channels.

5.3 Comparison of Learned CFAs with Hand-Crafted CFAs

In the third analysis we compare the performance of the CFAs learned by the proposed HardMax approach with the hand-crafted fixed CFAs using the same proposed demosaicer model. For this test, two RGB (Bayer and Lukac) and two RGBW (RGBW and CFZ) filters, shown in Figure 1, were selected. For each fixed CFA, training image patches are filtered with the respective filter beforehand and the proposed demosaicing model is trained over the filtered training dataset separately for each CFA type. All cases are evaluated with 8×8888\times 88 × 8 CFA size. The achieved average PSNR and SSIM metrics over the two test datasets can be found in Table 3.

CFA Bayer Lukac Proposed
Kodak PSNR 39.036 38.683 40.451
SSIM 0.9786 0.9808 0.9844
BSD500 PSNR 39.584 38.897 40.054
SSIM 0.9877 0.9873 0.9897
(a) RGB
CFA RGBW CFZ Proposed
Kodak PSNR 41.051 38.643 41.881
SSIM 0.9866 0.9809 0.9871
BSD500 PSNR 40.820 37.981 41.181
SSIM 0.9913 0.9859 0.9918
(b) RGBW
Table 3: Comparison between the performance of the HardMax CFAs and hand-crafted CFAs with the proposed demosaicing model. For (a) RGB and (b) RGBW configurations.

The results show that the proposed demosaicer with the learned HardMax CFA shows the highest reconstruction performance in both RGB and RGBW configurations for both test datasets. Considering that for each CFA the demosaicing model is same, this analysis shows that the demosaicer trained together with a learned CFA surpasses the hand-crafted CFAs in reconstruction quality. This is important in showing that filter learning incorporated into the demosaicing in a single training pipeline exploits the features of the training data for the best reconstruction compared to a general work-for-all fixed CFA.

CFA Learning Module Unconstrained Constrained
Linear Weighted SoftMax HardMax
Kodak PSNR 49.435 39.034 40.451
SSIM 0.9987 0.9782 0.9844
BSD500 PSNR 49.426 37.819 40.054
SSIM 0.9991 0.9637 0.9897
(a)
CFA Learning Module Unconstrained Constrained
Linear Weighted SoftMax HardMax
Kodak PSNR 49.389 39.407 41.881
SSIM 0.9982 0.9788 0.9871
BSD500 PSNR 49.907 38.074 41.181
SSIM 0.9989 0.9839 0.9918
(b)
Table 4: The performance of tested CFA learning models (a) for RGB and (b)RGBW configurations.

5.4 Comparison with Alternative CFA Learning Methods

In this analysis, we compare the performance of the proposed HardMax CFA learning module with the state-of-the-art DL-based CFA learning approaches [19, 22] which are summarized in Section 2.1. The approach in [19] learns unconstrained linear weights for each pixel rather than constraining the solution space to only binary weights or single channel selection at each pixel as in the proposed HardMax method and the Weighted SoftMax in[22]. It is expected that learning unconstrained linear weights will provide better performance since it has a much wider optimization space. However, we compared all three approaches over the same training and test dataset combinations. All the alternative joint CFA learning-demosaicing methods were trained and tested as they were proposed in their respective studies. The achieved PSNR and SSIM results can be found in Table 4 for both RGB and RGBW configurations.

Refer to caption
(a) Original
Refer to caption
(b) PSNR: 39.0359
Refer to caption
(c) PSNR: 39.6394
Refer to caption
(d) PSNR: 42.5311
Refer to caption
(e) Original
Refer to caption
(f) PSNR: 37.7100
Refer to caption
(g) PSNR: 38.7221
Refer to caption
(h) PSNR: 39.4920
Refer to caption
(i) Original
Refer to caption
(j) PSNR: 38.9645
Refer to caption
(k) PSNR: 39.7484
Refer to caption
(l) PSNR: 40.3756
Refer to caption
(m) Original
Refer to caption
(n) PSNR: 40.2638
Refer to caption
(o) PSNR: 39.3830
Refer to caption
(p) PSNR: 40.4574
Figure 7: Example reconstructed images in RGB case. From left to right: Original images, and reconstructed images with Bayer, Weighted SoftMax, and the proposed HardMax CFAs. The PSNR value for each reconstructed image is shown below the image.
Refer to caption
(a) Epoch 1, PSNR: 26.9470
Refer to caption
(b) Epoch 2, PSNR: 32.0163
Refer to caption
(c) Epoch 4, PSNR: 36.7329
Refer to caption
(d) Epoch 5, PSNR: 38.8575
Refer to caption
(e) Epoch 10, PSNR: 40.8191
Refer to caption
(f) Epoch 15, PSNR: 41.8566
Refer to caption
(g) Epoch 20, PSNR: 42.0437
Refer to caption
(h) Epoch 50, PSNR: 42.4272
Figure 8: A sample reconstructed image with RGBW CFA with PSNR values as the training progresses, at (a) Epoch 1, (b) Epoch 2, (c) Epoch 4, (d) Epoch 5, (e) Epoch 10, (f) Epoch 15, (g) Epoch 20, (h) Epoch 50.

The unconstrained linear CFA outperforms the constrained binary CFAs in all compared metrics under both RGB and RGBW configurations. Linear filter in [19] was expected to outperform both binary filters since it has a larger optimization space and is not bound to selecting one single color channel per pixel. If it is possible to construct a camera that can acquire measurements at each pixel as linear weighted combination of each color channel with weights learned through the CFA learning it would be optimal. However, for most practical cameras, which are constrained to observe only one of the color channels for each pixel at a time, we can compare the performance of the proposed approach and the Weighted SoftMax in [22]. This comparison can be found under the constrained column of Table 4 and it can be seen that the proposed HardMax CFA learning approach surpasses the Weighted SoftMax by 1.42 dB in RGB case and 1.48 dB in RGBW case and also results in higher percentage values for the SSIM scores for both test datasets.

We illustrate four example test images chosen from the two test datasets in Fig. 7 along with their reconstructions from fixed Bayer CFA, and learned CFAs from Weighted SoftMax and the proposed HardMax approach. The achieved PSNR values for each individual image are shown in the same figure and it can be seen that the proposed joint architecture results in higher PSNR values than fixed Bayer or Weighted SoftMax based learned CFAs. In addition to the PSNR metric, visual comparisons can be done on Fig. 7 between the original images and the reconstructions from different CFAs.

5.5 Analysis on the Learned CFA in Training

As we establish the high performance of the proposed joint binary CFA learning and the demosaicing model, in this analysis we start looking into the behavior of the model during the training process. For this analysis, we observe the effect of the training by comparing the performance metrics, reconstruction qualities, and learned filters at different epochs as the training progresses. In this test, one sample of the proposed joint model was trained, and its weights were saved after epochs 1, 2, 4, 5, 10, 15, 20, and 50. These saved weights were used with the test dataset to evaluate the change in image reconstruction quality.

Epoch 1 2 4 5 10 15 20 50
Kodak PSNR 27.2751 31.6592 35.9546 37.4252 39.2274 40.0207 39.9918 40.4209
SSIM 0.8406 0.8967 0.9591 0.9677 0.9795 0.983 0.9833 0.984
BSDS500 PSNR 26.6396 30.2478 34.5898 36.3041 38.2320 39.1354 39.3217 39.8671
SSIM 0.8557 0.907 0.9624 0.9737 0.9836 0.9869 0.9878 0.9887
(a)
Epoch 1 2 4 5 10 15 20 50
Kodak PSNR 27.9818 26.8873 37.4772 38.5111 39.7669 39.7566 40.6076 40.6922
SSIM 0.8312 0.874 0.9725 0.9747 0.9819 0.9843 0.9853 0.9858
BSDS500 PSNR 27.9691 27.0902 37.3992 38.4566 40.0099 40.2789 41.0271 41.3370
SSIM 0.8497 0.879 0.9775 0.9803 0.9866 0.9889 0.9899 0.9908
(b)
Table 5: Performance of the proposed model during training after different number of epochs. For (a) RGB and (b) RGBW configurations.

Table 5 shows the performance of the model at different epochs. The results show that both PSNR and SSIM metrics improve with increasing epochs and converge to their final values. The PSNR after a single epoch is at 27dB, and after epoch 20 this result was improved to approximately 40dB and the final average PSNR at epoch 50 is 40.4dB over the test dataset. As the training progresses, the model does a better job at reconstructing the images while the CFA pattern is also forming in training. Figure 8 shows an example reconstructed image at different epoch numbers. While the reconstruction artifacts are clearly visible at early epochs, after 15 to 20 epochs, image reconstruction performance improves, and finally a 42.4242.4242.4242.42 dB PSNR is achieved after epoch 50.

5.6 Effect of Training Dataset Size on Learned CFAs

In the sixth and final analysis, our goal is to understand the effect of the training data size on the learned CFA and the performance of the demosaicing model. We created training datasets with 250250250250K, 500500500500K, and 1111 Million samples of 8×8888\times 88 × 8 size image blocks. For each dataset size, the proposed model is trained 10 independent times with random weight initialization.

Table 6 shows the mean and variance in the PSNR and SSIM metrics over Kodak and BSDS500 datasets. As the size of the training dataset increases, both the average PSNR and SSIM metrics improve. Another important observation is that the variance in the achieved performances decreases as models are trained over larger datasets. This shows that even though each training might end up with a different learned CFA and a demosaicing model, the achieved performances with these models are mostly consistent with lower variance in larger datasets.

Dataset Size 1 Mil. 500K 250K
Avg Var Avg Var Avg Var
Kodak PSNR 40.357 0.0297 39.866 0.1423 39.100 0.2218
SSIM 0.985 0.00196 0.983 0.00139 0.97 0.00357
BSDS500 PSNR 39.843 0.1330 39.217 0.2332 38.245 0.2967
SSIM 0.989 0.00138 0.988 0.00182 0.984 0.00333
Table 6: The mean and variance in the PSNR and SSIM metrics over Kodak and BSDS500 test images for various training dataset sizes over 10 independent trainings with random weight initialization.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9: Example learned CFAs over training dataset sizes of (a) 250K, (b) 500K, (c) 1 million.

One example of learned CFA for each training dataset size is illustrated in Fig. 9. It can be seen that the learned CFA over the largest training dataset shows more checker-like patterns, more uniformly sampling each color channel. Since each independent training case might result in different CFAs, it is not proper to state that there is only one optimal CFA. Each of our independent training cases resulted in different CFAs and demosaicing models. However, our observation is that the variance on both the learned CFAs as well as the achieved performance in terms of PSNR and SSIM metrics gets lower as training datasets are larger. Due to limited computational resources, we are able to train our models over a maximum of 1 million image patches. However, training the proposed architecture over much larger datasets would be informative on whether the observed trends in performance increases and lower variances continue for larger dataset cases.

6 Discussion and Future Work

In this paper, we demonstrate a joint architecture both learning a binary color filter array together with demosaicing with a deep neural network. Our results show that the learned CFAs with the proposed architecture result in enhanced reconstruction performance compared to classical fixed CFAs such as Bayer. Since the proposed approach learns to select a single color channel at each pixel, learned CFAs are practical and physically implementable in digital cameras.

In this section, we would like to discuss the proposed approach and its results in terms of its implications and potential future work. First, the proposed architecture is composed of two submodules: one for learning a binary CFA and the other for reconstructing a full-color image from the sampled CFA output. Both of these modules are learned when we train the architecture jointly. Hence the learned CFA is dependent on the task of the second module, which is demosaicing. Suppose another task such as classification is utilized with a neural network architecture in the second module. In that case, the proposed joint architecture can learn a CFA that would be optimal for that task. Hence the proposed joint architecture allows task-dependent CFA learning which could have other future applications.

The results from Section 5.2 shows that an optimal CFA size of 8×8888\times 88 × 8 is observed for the proposed architecture. Any higher filter size resulted in a reduction in reconstruction quality. This could mean that larger and more complex CFA patterns might not be necessary for higher reconstruction quality. An important point is that the any machine learning model depends on the training data and are optimal for that dataset. While this could result in more distinct CFAs for specific applications, it also means a larger training dataset is required to learn a more generalized CFA. The results in Section 5.6 show that with increased training dataset sizes even though different CFAs are learned, the average reconstruction performance increases, and the variance in the performance gets smaller. We believe training of the proposed architecture over much larger datasets has the potential to lead to more generalized learned CFAs.

An interesting observation is the bias toward the blue channel in the learned CFAs. This finding contrasts with the idea of prioritizing the green channel in hand-designed filters as the more informative channel, especially taking into account that green is the least selected color in almost all the learned filters. It is important to include that a few filters end up having more red pixels, but green almost always appears as the least selected pixel. More extensive comparisons with alternative CFA learning methods and better analyses might lead to more definitive answers.

Future work on this study includes analysis on the effect of noise and cross-talk, implementation and analysis of the HardMax module with neural network models for various computational imaging tasks for learning task-specific CFAs, and potential hardware implementation of the proposed filters for a more realistic analysis.

7 Conclusion

This study presents a binary CFA learning module based on hard thresholding with a deep learning-based demosaicing network in a joint architecture. While a measurement learning approach based on gradient adaptation is developed for binary CFA learning, a demosaicer architecture based on novel DL-based image reconstruction models is jointly learned. The proposed model is trained and tested over Kodak and BSDS500 datasets. Since the proposed approach learns to select a single color channel at each pixel, the learned CFA is easily adaptable to modern commercial cameras. Both RGB and RGBW CFAs can be learned with the proposed approach and increased reconstruction performance in PSNR and SSIM metrics are achieved compared to both fixed well-known filters such as Bayer or alternative learned filters.

References

  • [1] B. Bayer, “Color imaging array,” United States Patent, no. 3971065, 1976.
  • [2] R. Lukac and K. N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer electronics, vol. 51, no. 4, pp. 1260–1267, 2005.
  • [3] T. Seiji, “Color imaging apparatus,” United States Patent, no. 8531563, 2013.
  • [4] K.-L. Chung, T.-H. Chan, and S.-N. Chen, “Effective three-stage demosaicking method for rgbw cfa images using the iterative error-compensation based approach,” Sensors, vol. 20, no. 14, p. 3908, 2020.
  • [5] B. K. Gunturk, J. Glotzbach, Y. Altunbasak, R. W. Schafer, and R. M. Mersereau, “Demosaicking: color filter array interpolation,” IEEE Signal processing magazine, vol. 22, no. 1, pp. 44–54, 2005.
  • [6] H. S. Malvar, L.-w. He, and R. Cutler, “High-quality linear interpolation for demosaicing of bayer-patterned color images,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. iii–485, IEEE, 2004.
  • [7] R. Lukac, K. N. Plataniotis, D. Hatzinakos, and M. Aleksic, “A novel cost effective demosaicing approach,” IEEE Transactions on Consumer Electronics, vol. 50, no. 1, pp. 256–261, 2004.
  • [8] R. Kimmel, “Demosaicing: Image reconstruction from color ccd samples,” IEEE Transactions on image processing, vol. 8, no. 9, pp. 1221–1228, 1999.
  • [9] X. Li, “Demosaicing by successive approximation,” IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 370–379, 2005.
  • [10] R. Lukac, K. N. Plataniotis, D. Hatzinakos, and M. Aleksic, “A new cfa interpolation framework,” Signal processing, vol. 86, no. 7, pp. 1559–1579, 2006.
  • [11] X. Li, B. Gunturk, and L. Zhang, “Image demosaicing: A systematic survey,” in Visual Communications and Image Processing 2008, vol. 6822, pp. 489–503, SPIE, 2008.
  • [12] M. Safna Asiq and W. Sam Emmanuel, “Colour filter array demosaicking: a brief survey,” The Imaging Science Journal, vol. 66, no. 8, pp. 502–512, 2018.
  • [13] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand, “Deep joint demosaicking and denoising,” ACM Transactions on Graphics (ToG), vol. 35, no. 6, pp. 1–12, 2016.
  • [14] F. de Gioia and L. Fanucci, “Data-driven convolutional model for digital color image demosaicing,” Applied Sciences, vol. 11, no. 21, p. 9975, 2021.
  • [15] B. Park and J. Jeong, “Color filter array demosaicking using densely connected residual network,” IEEE Access, vol. 7, pp. 128076–128085, 2019.
  • [16] D. S. Tan, W.-Y. Chen, and K.-L. Hua, “Deepdemosaicking: Adaptive image demosaicking via multiple deep fully convolutional networks,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2408–2419, 2018.
  • [17] F. Kokkinos and S. Lefkimmiatis, “Iterative joint image demosaicking and denoising using a residual denoising network,” IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4177–4188, 2019.
  • [18] J. Tang, J. Li, and P. Tan, “Demosaicing by differentiable deep restoration,” Applied Sciences, vol. 11, no. 4, p. 1649, 2021.
  • [19] B. Henz, E. S. Gastal, and M. M. Oliveira, “Deep joint design of color filter arrays and demosaicing,” in Computer Graphics Forum, vol. 37, pp. 389–399, Wiley Online Library, 2018.
  • [20] L. Bian, Y. Wang, and J. Zhang, “Generalized msfa engineering with structural and adaptive nonlocal demosaicing,” IEEE Transactions on Image Processing, vol. 30, pp. 7867–7877, 2021.
  • [21] F. Zhang and C. Bai, “Jointly learning spectral sensitivity functions and demosaicking via deep networks,” in 2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC), pp. 404–411, IEEE, 2021.
  • [22] A. Chakrabarti, “Learning sensor multiplexing design through back-propagation,” Advances in Neural Information Processing Systems, vol. 29, 2016.
  • [23] R. Jacome, J. Bacca, and H. Arguello, “Deep-fusion: An end-to-end approach for compressive spectral image fusion,” in 2021 IEEE International Conference on Image Processing (ICIP), pp. 2903–2907, IEEE, 2021.
  • [24] R. Wu, Y. Li, X. Xie, and Z. Lin, “Optimized multi-spectral filter arrays for spectral reconstruction,” Sensors, vol. 19, no. 13, p. 2905, 2019.
  • [25] T. W. Sawyer, M. Taylor-Williams, R. Tao, R. Xia, C. Williams, and S. E. Bohndiek, “Opti-msfa: a toolbox for generalized design and optimization of multispectral filter arrays,” Optics Express, vol. 30, no. 5, pp. 7591–7611, 2022.
  • [26] K. Li, D. Dai, and L. Van Gool, “Jointly learning band selection and filter array design for hyperspectral imaging,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6384–6394, 2023.
  • [27] J. Li, C. Bai, Z. Lin, and J. Yu, “Optimized color filter arrays for sparse representation-based demosaicking,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2381–2393, 2017.
  • [28] C. Bai, F. Liu, and J. Li, “Joint learning of rgbw color filter arrays and demosaicking,” Available at SSRN 4753575.
  • [29] H. Arguello, J. Bacca, H. Kariyawasam, E. Vargas, M. Marquez, R. Hettiarachchi, H. Garcia, K. Herath, U. Haputhanthri, B. S. Ahluwalia, et al., “Deep optical coding design in computational imaging,” arXiv preprint arXiv:2207.00164, 2022.
  • [30] J. Bacca, T. Gelvez-Barrera, and H. Arguello, “Deep coded aperture design: An end-to-end approach for computational imaging tasks,” IEEE Transactions on Computational Imaging, vol. 7, pp. 1148–1160, 2021.
  • [31] K. Hirakawa and P. J. Wolfe, “Spatio-spectral color filter array design for enhanced image fidelity,” in 2007 IEEE International Conference on Image Processing, vol. 2, pp. II–81, IEEE, 2007.
  • [32] W. Lu and Y.-P. Tan, “Color filter array demosaicking: new method and performance measures,” IEEE transactions on image processing, vol. 12, no. 10, pp. 1194–1210, 2003.
  • [33] R. Kakarala and Z. Baharav, “Adaptive demosaicing with the principal vector method,” IEEE Transactions on Consumer Electronics, vol. 48, no. 4, pp. 932–937, 2002.
  • [34] J. Portilla, D. Otaduy, and C. Dorronsoro, “Low-complexity linear demosaicing using joint spatial-chromatic image statistics,” in IEEE International Conference on Image Processing 2005, vol. 1, pp. I–61, IEEE, 2005.
  • [35] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE transactions on image processing, vol. 26, no. 9, pp. 4509–4522, 2017.
  • [36] C. O. Ayna and A. C. Gurbuz, “Learning optimum binary color filter arrays for demosaicing with neural networks,” in Real-Time Image Processing and Deep Learning 2024, vol. 13034, pp. 174–182, SPIE, 2024.
  • [37] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
  • [38] J. Feng, J. Chen, Q. Sun, R. Shang, X. Cao, X. Zhang, and L. Jiao, “Convolutional neural network based on bandwise-independent convolution and hard thresholding for hyperspectral band selection,” IEEE Transactions on Cybernetics, vol. 51, no. 9, pp. 4414–4428, 2020.
  • [39] W. Shi, F. Jiang, S. Liu, and D. Zhao, “Image compressed sensing using convolutional neural network,” IEEE Transactions on Image Processing, vol. 29, pp. 375–388, 2019.
  • [40] R. Mdrafi and A. C. Gurbuz, “Joint learning of measurement matrix and signal reconstruction via deep learning,” IEEE Transactions on Computational Imaging, vol. 6, pp. 818–829, 2020.
  • [41] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.
  • [42] R. W. Franzen, “True color kodak images.”
  • [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.