Learning Binary Color Filter Arrays with Trainable Hard Thresholding
Abstract
Color Filter Arrays (CFA) are optical filters in digital cameras that capture specific color channels. Current commercial CFAs are hand-crafted patterns with different physical and application-specific considerations. This study proposes a binary CFA learning module based on hard thresholding with a deep learning-based demosaicing network in a joint architecture. Unlike most existing learnable CFAs that learn a channel from the whole color spectrum or linearly combine available digital colors, this method learns a binary channel selection, resulting in CFAs that are practical and physically implementable to digital cameras. The binary selection is based on adapting the hard thresholding operation into neural networks via a straight-through estimator, and therefore it is named HardMax. This paper includes the background on the CFA design problem, the description of the HardMax method, and the performance evaluation results. The evaluation of the proposed method includes tests for different demosaicing models, color configurations, filter sizes, and a comparison with existing methods in various reconstruction metrics. The proposed approach is tested with Kodak and BSDS500 datasets and provides higher reconstruction performance than hand-crafted or alternative learned binary filters.
Keywords color filter array hard thresholding measurement learning straight-through estimator deep learning demosaicing
1 Introduction
A digital camera captures an image by exposing its sensor array in which each sensor corresponds to one pixel in the final image to the incoming light for a certain amount of time. A Color Filter Array (CFA) is an optical filter placed on a camera sensor array. Sensors alone cannot differentiate between individual colors; instead, CFAs facilitate capturing color information by sifting only one frequency band in the visible light spectrum corresponding to the selected color per pixel. The raw input of the filtered camera sensor array corresponds to an image in which each pixel contains the intensity information of only one color channel and lacks the rest.
Virtually all available commercial CFAs are designed by hand with different considerations depending on the camera and environment characteristics. The most commonly used CFA pattern is the Bayer filter [1]. Several other hand-crafted CFAs are also present in specific camera models such as Lukac filter [2], Kodak’s CYYM filter, Fujifilm’s X-Trans [3], CMWY filter, and Compton’s RGBW filter along with Kodak’s RGBW filter variations [4]. The hand-crafted filters used in this study for evaluation are illustrated in Figure 1. The study [2] provides an extensive review and analysis of the importance of CFA design on the final image.
The process following the color filtering operation in digital image acquisition is known as demosaicing, which is estimating the unknown color values in raw camera returns [5]. Common demosaicing algorithms are mainly spatial or frequency domain interpolation-based techniques [6, 7]. These algorithms show variation for specific applications and different CFA types [8, 9, 10]. Detailed reviews of the classical demosaicing algorithms for various CFAs are available in [11, 12]. A unique CFA design requires a dedicated demosaicing algorithm depending on the filter pattern and available information from the captured scene.
With the emergence of machine learning (ML) and neural networks (NN) in computational imaging, various studies that suggest using NNs in demosaicing or joint demosaicing-denoising pipelines [13, 14, 15, 16, 17] have appeared. These approaches work with raw camera return and propose various NN architectures for mapping the inputs to the full-color image. These methods show that enhanced reconstruction quality and computational speed can be achieved with deep learning (DL).
Recent studies applied ML solutions for learning a CFA pattern to address the issue of exploiting the features of natural images for high-quality full-color image reconstruction [18, 19]. Although working in the RGB domain, these solutions learn CFAs that employ the full digital color spectrum; their process learns a linear combination of all three color channels. This approach results in learning a unique color per sensor. Although weighted combinations of colors provide enhanced reconstructions compared to fixed CFAs like Bayer, these learned filters are impractical for physical implementation in commercial cameras where each pixel reads a single color from the digital color configuration. Some other studies assume working in the multispectral domain and learning MultiSpectral Filter Arrays (MSFA) [20, 21]. Although there is active research for building commercial cameras with MSFAs, these cameras are still in the prototype phase due to their high production cost and the raw image formats requiring additional operations to get the usual RGB images as the final product. Section 2.1 includes a more detailed discussion on the applicability issues of MSFAs.
Modern commercial digital cameras use color configurations with a few colors, with most of them opting for RGB configuration. For this reason, it is necessary to develop learned binary CFAs that utilize only one color channel at each pixel and still provide enhanced image reconstruction compared to hand-crafted CFAs. To the best of our knowledge, only one method presents a way to learn a true binary CFA pattern in RGBW configuration [22]. This study adapts SoftMax operation with a scalar value that increases in time exponentially in order to acquire quasi-binary filter weights from output weights.
This paper proposes an alternative method for learning a binary CFA in a joint filter-demosaicer architecture. The proposed joint framework is an end-to-end architecture with two modules; the head module learns a constrained binary CFA during training, and the tail module reconstructs a color image from filtered raw camera images. This joint learning approach enforces the learned binary CFA to be optimal for color image demosaicing. The proposed method adapts hard thresholding as band selection operation in the CFA learning module that is compatible with stochastic gradient descent. The CFAs learned with this method can be used in camera sensor arrays without the impracticality concerns since the selection of only one color channel is enforced for each pixel. Our results indicate CFAs learned with this approach provide a higher reconstruction performance than the hand-designed filters and the alternative proposed in [22]. With reference to the hard thresholding as the basis of this process, we named our binary CFA learning module HardMax. The novelties of this study can be described in the following points.
-
•
This study presents an NN model for joint color filtering - demosaicing modules with a novel binary CFA learning mechanism and an indigenous high-performing demosaicer architecture.
-
•
The proposed CFA learning method (HardMax) is adaptable to different joint DL architectures, allowing us to learn optimum CFAs for different objectives of the architectures following the CFA module such as reconstruction or classification.
-
•
Unlike other proposed methods, HardMax aims to find an optimum binary CFA so that the learned CFA can be easily applied to commercial digital cameras.
-
•
This study includes the evaluation and analysis of different parameters that affect the learned CFAs, such as filter size, color configuration, and training data size.
The rest of the paper follows this structure; Section 2 gives a background on the available literature on the machine learning-based CFA design. Section 3 describes the proposed CFA learning and demosaicing method along with the training and evaluation procedures. Section 4 describes the dataset, training process, and evaluation. Section 5 presents results, the obtained performance metrics, and the comparison with the existing approaches. Section 6 includes discussion on the proposed approach and results, the shortcomings of the study, and the potential road map for future work. Finally, section 7 draws the conclusions.
2 Background
2.1 Machine Learning in CFA Design
Compared to demosaicing, the volume of studies on the ML-based CFA design problem is small, with the recent literature focusing on MSFAs. Some of these methods are designed for hyperspectral image acquisition [23, 24, 25], while others target commercial digital cameras [20, 21, 26]. However, multispectral image acquisition has several issues that prevent it from being used in commercial digital cameras. The first problem is the high cost of multispectral cameras due to the complexity of their production. The second issue is that a higher number of frequency bands causes the curse of dimensionality for a full-color image. Thirdly, the larger interval of missing channel values corresponds to lower resolution in the final product and more complex demosaicing algorithms. For these reasons, this study focuses on solutions for CFAs that use the available number of colors in digital image (usually three chromatic and one luminance “white” channel).
The lack of available literature on ML-based CFA design is even more striking when the literature search is constrained to the RGB case. There is only a handful of studies for modern ML techniques in CFA design [19, 27]. As an example, Henz et al. [19] introduced an autoencoder architecture as a joint filtering-demosaicing pipeline where the encoder is composed of a 3D tensor with weights for each color channel (three nonnegative unconstrained weight sets for each color channel) plus an unconstrained bias term used in a matrix multiplication with the full-color image in the training process (see Figure 2(a)). This method is included in the evaluation and will be called Linear filter in the rest of the paper for convention. Li et al. [27] propose a CFA design algorithm based on representing CFAs as overcomplete dictionaries that sample an original image and finding the set of dictionary values that minimize the mutual coherence value, for mutual coherence is an important factor in signal reconstruction and smaller coherence corresponds to higher quality reconstruction. In practice, Li’s filter corresponds to a similar linear projection operation as in [19] without a bias term.
Even though the studies [19, 27] define their search domain on the RGB channel, their final product is not a purely binary RGB filter but a filter with different linear combinations of available channels for each pixel. That is due to the mentioned studies defining their CFA learning process as a linear sum operation on the RGB spectrum. In return, the unconstrained linear combination of non-negative weights leads to a quasi-infinite number of potential colors for selection. In order to learn a CFA from available color channels, the CFA learning problem must be redefined as a constrained selection problem rather than a weighted linear sum. There is only one study known to us that adopts such a strategy; Chakrabarti’s method in [22] uses a soft thresholding adaptation of binary selection by applying SoftMax on a set of weights for each color channel per pixel (Figure 2(b)). The SoftMax operation is also controlled by a separate scalar value for weights before the SoftMax operation. The scalar is initialized as a small value. As training progresses, this scalar grows to binarize the elements of the output vector. This scalar value increases as training processes, stretching the SoftMax output more into the limit values of 0 and 1. For convenience, this approach is named as Weighted SoftMax in this paper.
A recent study [28] learns a separate CFA but essentially uses the same CFA learning scheme as in [22]. Several regularization enforced binary coded aperture learning mechanisms are yet to be adapted to color filter selection [29, 30]. This paper aims to introduce a new approach for learning a constrained binary CFA that leads to a higher reconstruction performance than the available solution.
2.2 DL-Based Demosaicing
There is a plethora of demosaicing algorithms designed to work with different CFA filters or address problems in image recovery. The classical approaches can be grouped into three categories according to their strategies [11]; complex interpolation algorithms accounting geometry or optics [31, 6], heuristic algorithms that build the digital image upon iterative recovery (for instance, recovering luminance, then chrominance) [32, 33], or statistical models assuming an interpixel or interchannel dependency (like sparsity-based interpolation methods) [9, 34]. Because of the lack of information about the interpixel and interchannel dependency and the high variability of these values, each approach has its drawbacks and issues.
The DL-based demosaicing methods emerged after the neural networks proved to be efficient function approximators in many computer vision applications. It is important to note that the demosaicing problem is an undetermined reconstruction problem [35], and this fact is exploited heavily by the sparse representation-based demosaicing algorithms that employ dictionary search or other methods developed for CS [9, 34]. For the same reason, the available DL-based demosaicing models share similarities with the DL-based image reconstruction and super-resolution models.
Here, we present the three DL-based demosaicing models selected for comparison during the evaluation of our demosaicing model in this study. The reason for selecting these studies is that just like our proposed framework all three models are used alongside a CFA learning algorithm in their respective studies. Common to all approaches, we assume an image patch of the size as the final output, where is the size of both the width and the height, and is the number of channels. This patch is one of the all non-overlapping adjacent patches extracted from the original image.
The first demosaicing architecture is found in [19] (Linear) and it was inspired by autoencoders and fully convolutional network architectures. The input is an -sized feature map, which includes number of individual color channels (called submosaics), the number of interpolated submosaics with a k-neighboring kernel, and the monochromatic raw image. The separate inclusion of submosaics and their interpolated version is to help the network to skip the procedure of interpolation and channel separation in order to let the model focus on the reconstruction. The interpolations of submosaics are created and concatenated to the raw image with an interpolation kernel between the filtering (encoding) and demosaicing (decoding) operations. The rest of the demosaicing model is a fully convolutional network with 12 layers in total. The kernel size is fixed to . The first six layers have 64 kernels, while the last six have 128. At the end of the model, the input raw camera sensor matrix is concatenated with the output of the last convolutional layer. Then, this tensor is passed through a final convolutional layer to get the reconstructed full-image patch.
The second demosaicing architecture is adapted from the compared alternative study (Weighted SoftMax) [22]. In this architecture, the input is a -size raw camera measurement of a image patch with its neighboring area, and the output is a -size reconstructed central full-color image block. The reason for using surrounding patches is to use the information around the central patch to reinforce the reconstruction quality and prevent artifacts. The demosaicing architecture includes two parallel streams creating color channel priors. The first stream consists of a fully connected (FC) layer with number of neurons followed by a reshaping operation and a convolutional layer. The purpose of this stream is to extract all the color information from the raw sensor readings. In the original study, the FC layer is preceded by a natural logarithm and succeeded by exponential operation. In the evaluations, this approach caused the training accuracy to fluctuate and even diverge from a solution; thus, we had to discard it. The second stream is an encoder composed of a group of convolutional layers with an number of kernels and ReLU activation function, followed by an FC layer with neurons and a reshape operation. This stream’s purpose is to capture spatial features independent of the color channel to augment the estimation of the absent color values. In our comparison, and values are 128 and 32. The same values were used in this study.
The third demosaicing architecture is based on a DL-based demosaicing model proposed in [14]. In the suggested design, the raw camera return is processed in two different modules to reconstruct two different information: low-frequency color (chrominance) and high-frequency shapes (luminance). The luminance reconstruction network returns a single matrix intended to learn the grayscale information which carries most of the low-level information, such as edges and patterns. This network consists of only one hidden convolutional layer and one output convolutional layer. Since there are no further details on this network, we chose the filter sizes for all layers as , while the hidden layer has 64 filters and the output layer has one filter. On the other hand, the chrominance network is an autoencoder and returns an size output. Like the luminance network, the original paper does not mention the actual architecture. For this reason, this study devised an autoencoder with 3 convolutional layers and 3 deconvolutional in total. The first three layers have 64, 128, and 256 sized kernels with stride respectively. The three deconvolutional layers following this recover the same shape to create the color information matrix. The outputs of the luminance and the chrominance networks are then summed up to create the final reconstruction.
3 Proposed Method
The proposed HardMax layer for learning CFA along with the demosaicer model is an extension of the study presented in [36]. In this section, we will explain the details of the binary CFA learning algorithm, the architecture of the proposed demosaicer model, and the use of both modules in a joint framework.
3.1 Joint Binary CFA Learning and Demosaicing Architecture
An important aspect of DL-based solutions is their capability to combine multiple optimization problems into a single framework. For our framework, we define two separate objectives. The primary objective is to learn a binary CFA, and the secondary objective is to reconstruct a color image from raw sensor inputs with the given binary CFA. The goal is to achieve both these objectives together in a single DL architecture that learns binary CFAs that result in high reconstruction performance. Mainly, joint frameworks have the advantage of working in a combined search space, therefore eliminating the risk of overshooting in the overall process while performing singular tasks as well as separate models. Different versions of joint CFA and demosaicing solutions have also been used in the studies [19, 22] as detailed in Section 2.
The full neural network model proposed in this study is a combination of the binary CFA learning module detailed in Section 3.2 and the demosaicing model based on image reconstruction networks described in Section 3.3. Figure 3 shows a visual representation of the full model in a single forward propagation. During training, the full model takes an image patch as input and returns a reconstruction of the image patch as output. The first module of the joint model is a binary CFA learning module that acts as a simulated color filter and returns an output which corresponds to raw camera sensor observations. This output is then passed into the demosaicing model for reconstruction of the full-color image. The pseudoimage returned by the first convolution operation of the demosaicing module is refined by consecutive refinement submodules. The complete architecture is learned jointly by minimizing a common loss function enforcing mean square error between reconstructed and labeled images to be smaller.
The main consideration in any end-to-end DL framework is guaranteeing differentiability in every step. The current method of solution searching in NNs involves stochastic gradient descent (SGD), an update algorithm that uses the gradient operation in calculating weight change increments. This is a critical problem specifically in the binary CFA learning module as it will be detailed in the next section.
3.2 HardMax: The Binary CFA Learning Module
The problem of adapting the binary selection operation in neural networks originates from the thresholding function being non-differentiable. A solution to the problem can be achieved with the Straight-Through Estimator (STE). STE is gradient estimation for functions without differentials in potential parameter points [37]. Assigning an STE to a non-differentiable function begins with observing the function’s behavior in the concerned domain, then looking for a set of potential functions which are following an output similar to the actual function and are differentiable in that domain. The basic idea of our application was based on the study in [38]. In the case of binary selection, we redefine the process first as hard thresholding followed by normalization. In hard thresholding, the gradient is carried only when the weight value is higher than the threshold value, and this value is equal to one. An alternative function that behaves exactly the same in the relevant domain is the identity function. Therefore, the identity function’s derivative (i.e., just the value 1) can be selected as the STE of the hard thresholding operation.
With a solution for direct implementation of hard thresholding provided, we assume the problem of sampling an image block with channels and size. The objective of the learnable discrete CFA module is to select one channel in every pixel and discard the rest, creating number of measurements as an output. The HardMax module manages this by initializing a sized tensor with uniformly random values between the interval as real-valued weights. The weights don’t have to be constrained into a specific interval, and their sign is irrelevant since the actual information encoded with the weights is their relative greatness to each other. The highest weight value in each pixel represents the selected channel. This weight is defined as the threshold value. All pixel weights are passed into thresholding, as shown in Fig 4. The resulting vector is then binarized to convert the actual weight value into a binary mask.
The overall masking process is simple to implement, and with a preassigned gradient value, the backpropagation operation is computationally efficient with operations in total. For this reason, HardMax takes small computational resources and time both in training and evaluation.
3.3 Demosaicing Module
The HardMax module is designed to be the head of a larger joint network. This makes the HardMax adaptable to any DL-based demosaicing model as a preliminary learnable filter. For this study, we propose a new demosaicing model; a multi-stage convolutional neural network that learns to create a pseudoimage and to refine it through multiple refinement blocks with a cumulative loss function of refinement blocks.
Our demosaicing architecture is illustrated in Figure 5. The main structure is designed with ideas from image reconstruction models [39] and [40]. The demosaicing model takes the -size output of HardMax module (representing the sparse raw camera returns) as input and returns a -size reconstructed RGB image, with the desired reconstruction in the middle. The reason for the larger input-output patches is to counter the artifact effects that occur at the edges of reconstructed patches, as was also used in [22] and yields good results.
The model first converts the raw camera returns into a pseudo image, similar to [19]. But instead of a hand-picked interpolation kernel as in [19], our model creates a pseudo image by applying a convolutional layer with 3 kernels of size with padding for equal-sized output and no bias. The coarse reconstruction is enforced by including the pseudoimage and the consecutive refinement outputs into the loss function as extra terms enforcing them to be similar to the labeled images as well.
The pseudoimage is then pushed into three consecutive refinement submodules, each of which is composed of three convolutional layers followed by ReLU activation functions. The first layer has 128 kernels of size, the second layer has 64 kernels of size and the third layer has 3 kernels of size. All layers apply padding to preserve the image size and include an regularizer. Between each refinement block, there are skip connections added to carry over the gradient. The model’s final layer is another convolutional layer with 3 kernels of size and a ReLU activation function.
The model’s final loss function is defined as the sum of all mean squared error (MSE) loss computed over the total training image patches in a training batch, each refinement module outputs, and the pseudo image using the corresponding label image, as given in (1)
(1) |
where represents the i-th training sample and is the pseudoimage or the n-th refinement module output as shown in Fig. 5. denotes the final reconstructed image output of the model.
In our evaluations, we tested the joint binary CFA learning and demosaicing model with three existing DL-based demosaicing models (explained in Section 2.2) by only replacing the proposed demosaicing model. In this way, we show that the joint architecture allows the proposed binary CFA learning module to be used with different demosaicing models, and compare the effectiveness of demosaicer models on the learned filter and the overall performance. In addition, any future enhanced DL-based demosaicing model can be integrated with the binary CFA learning module both to learn enhanced CFA filters and achieve better reconstruction results.
4 Dataset and Training
4.1 Dataset and Setup
The training dataset used in this study is created from the training and validation images of the BSDS500 dataset [41] totaling 400 images. BSDS500 is a collection of px-sized RGB images originally published for segmentation benchmarks but can be found commonly as training data in other computer vision tasks, including image demosaicing. The training image patches are non-overlapping side-by-side blocks from this dataset, and the total number of the training patches is in the case of patch size.
We used two different test image datasets. The first dataset is composed of 20 images selected from BSDS500’s test images chosen based on their complexity in contour and texture. The second test dataset used in this study is the Kodak dataset [42] which is composed of different images with sizes. Performance metrics of individual datasets were recorded and compared separately.
4.2 Training
Training and evaluation of all models were conducted with an Nvidia A6000 GPU. The programming language selected for the implementation of the source code is Python 3.10 with TensorFlow library and Keras module. The loss function used in all demosaicing models is the mean squared error (MSE) loss as defined in (1). The Adaptive Movement Estimation (ADAM) optimizer is used during backpropagation [43] with decay rates ( and ) and division constant () values of ADAM optimizer were selected as , and respectively. The learning rate follows an exponential decay from the value from 0.0001 to 0.00001.
The source code used for training and testing the proposed HardMax model along with the compared models can be found in the GitHub repository.
4.3 Evaluation
The performance metrics used in the performance evaluation of the proposed approach are the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Metric (SSIM). These metrics are profound in the digital image processing literature for image reconstruction performance evaluations. PSNR is used to observe the distortion of the reconstruction compared to the original signal in terms of decibel units and a higher PSNR value indicates a better reconstruction performance. PSNR can be computed as in (2) where the mean squared error () is computed over the whole reconstructed image.
(2) |
SSIM is used to compare a reconstruction with its ground truth signal in terms of localized similarity. An SSIM score takes a value between 0 and 1, where a higher score indicates that the reconstructed image resembles the original image more. Calculation of SSIM involves the means and variances of the original (, ) and the reconstructed (, ) images, along with the covariance of both (). The division stabilizer constants and in the SSIM definition in (3) are chosen as and respectively.
(3) |
5 Results
The proposed joint binary CFA and demosaicing model is evaluated under six different scenarios. They are briefly summarized as follows:
- •
-
•
The second test in Sec. 5.2 evaluates the effect of the CFA size in the final reconstruction quality. The learned CFAs for each size under both RGB and RGBW configurations are provided.
-
•
The third test in Sec. 5.3 compares the performance of the CFA learned by the proposed HardMax module with respect to the popular hand-crafted fixed CFAs such as Bayer, Lukac, and others.
- •
-
•
The fifth test in Sec. 5.5 analyzes the convergence of the proposed method and the change of the learned CFAs during the training progress. The learned filters at different epochs and the corresponding image reconstructions and performance metrics are provided.
-
•
The sixth and final test in Sec. 5.6 evaluates the effect of the training data size on both the learned CFAs and the final reconstruction quality under cases from 250 thousand up to 1 million training image patches.
5.1 Comparison of Demosaicing Models
In the first analysis, we compare the performance of the proposed demosaicing model with the existing models [19, 22, 14] presented in Section 2.2. All demosaicing models are used in the joint framework by keeping the HardMax CFA learning module fixed and changing only the demosaicing models. A total of eight different joint models are trained and tested with the combination of four different demosaicing modules for both color configurations (RGB and RGBW). The average PSNR and SSIM metrics for Kodak and BSDS500 test datasets are provided in Table 1, with results for RGB and RGBW configurations presented separately.
The proposed demosaicing architecture shows the highest PSNR and SSIM values for both RGB and RGBW cases. Respectively for Kodak and BSDS500 test datasets, it provides 2.23 dB and 1.81 dB higher PSNR in RGB case, and 1.25 dB and 0.62 dB higher PSNR in RGBW case compared to the second highest demosaicer, the model in [19]. For the SSIM metric, the proposed model and the model in [19] perform better than the other two, with the proposed model surpassing in most of the test cases.
Categorically we observe that the models with feed-forward fully convolutional architectures (i.e., proposed model and [19]) perform comparably better than models with parallel architectures separating the reconstruction of color and texture information such as [22] and [14]. The proposed approach adapts deep learning-based image reconstruction to the demosaicing problem and unlike the model in [19] that use a hand-crafted kernel for an initial reconstruction, our model rather lets a dedicated convolutional layer learn to reconstruct a pseudo-image, and it uses skip connections and a combined final loss function to reinforce the reconstruction quality. Given the enhanced performance of the proposed demosaicing model, the rest of the evaluation scenarios will use the proposed demosaicing model in the joint architecture along with the HardMax CFA learning module.
5.2 Analysis on the CFA Size
In the second analysis, we investigate the effect of the CFA size on reconstruction performance. In this analysis, four different CFA sizes (, , , and ) have been tested with the proposed joint model under both RGB and RGBW color configurations. The achieved average PSNR and SSIM metrics over the test datasets for varying CFA sizes are presented in Table 2, with results for RGB and RGBW configurations presented separately.
CFA Size | 4x4 | 8x8 | 12x12 | 16x16 | |
---|---|---|---|---|---|
Kodak | PSNR | 38.561 | 40.451 | 39.535 | 38.865 |
SSIM | 0.9777 | 0.9844 | 0.9824 | 0.9813 | |
BSD500 | PSNR | 37.57 | 40.054 | 38.676 | 37.880 |
SSIM | 0.9826 | 0.9897 | 0.9873 | 0.9856 |
CFA Size | 4x4 | 8x8 | 12x12 | 16x16 | |
---|---|---|---|---|---|
Kodak | PSNR | 40.189 | 41.881 | 39.89 | 38.988 |
SSIM | 0.9842 | 0.9871 | 0.9825 | 0.9814 | |
BSD500 | PSNR | 39.700 | 41.181 | 38.59 | 37.878 |
SSIM | 0.9891 | 0.9918 | 0.987 | 0.9848 |
Our first observation from the Table 2 is that the highest PSNR and SSIM performance is achieved for CFA size in both RGB and RGBW color configurations. Using smaller or larger CFA sizes than affects the final reconstruction negatively. Another important point is that the existence of a luminance channel boosts the final performance, leading to higher PSNR and SSIM results. This means that color interpolation with sparse representation is possible if the luminance information is present.
The learned CFAs for each different size are visualized in Fig. 6 for both RGB and RGBW configurations. While learned filters don’t exactly match with any existing fixed CFA pattern, they seem to show some level of regularity in sampling a color channel. One observation is that the learned CFAs seem to prioritize the blue color over the others in most instances. The second most common channel is red for RGB and luminance for RGBW cases. This is an interesting contrast to the intuition of the hand-crafted CFAs such as Bayer, Lukac, and X-Trans where the green channel is prioritized with the assumption that the human eye is more sensitive to green channel information. While the inherent features of digital images might have enforced the CFA learning module to pick up blue and white channels to minimize the defined mean squared loss term, we believe it is important to understand the reasons DL models are making these selections.
It is also important to note that this study deals particularly with the image reconstruction problem, hence the learned CFAs are objective-specific and trained and optimized for the reconstruction task. Therefore the HardMax CFA learning module combined with a DNN model for a different objective may result in prioritizing to select different channels.
5.3 Comparison of Learned CFAs with Hand-Crafted CFAs
In the third analysis we compare the performance of the CFAs learned by the proposed HardMax approach with the hand-crafted fixed CFAs using the same proposed demosaicer model. For this test, two RGB (Bayer and Lukac) and two RGBW (RGBW and CFZ) filters, shown in Figure 1, were selected. For each fixed CFA, training image patches are filtered with the respective filter beforehand and the proposed demosaicing model is trained over the filtered training dataset separately for each CFA type. All cases are evaluated with CFA size. The achieved average PSNR and SSIM metrics over the two test datasets can be found in Table 3.
CFA | Bayer | Lukac | Proposed | |
---|---|---|---|---|
Kodak | PSNR | 39.036 | 38.683 | 40.451 |
SSIM | 0.9786 | 0.9808 | 0.9844 | |
BSD500 | PSNR | 39.584 | 38.897 | 40.054 |
SSIM | 0.9877 | 0.9873 | 0.9897 |
CFA | RGBW | CFZ | Proposed | |
---|---|---|---|---|
Kodak | PSNR | 41.051 | 38.643 | 41.881 |
SSIM | 0.9866 | 0.9809 | 0.9871 | |
BSD500 | PSNR | 40.820 | 37.981 | 41.181 |
SSIM | 0.9913 | 0.9859 | 0.9918 |
The results show that the proposed demosaicer with the learned HardMax CFA shows the highest reconstruction performance in both RGB and RGBW configurations for both test datasets. Considering that for each CFA the demosaicing model is same, this analysis shows that the demosaicer trained together with a learned CFA surpasses the hand-crafted CFAs in reconstruction quality. This is important in showing that filter learning incorporated into the demosaicing in a single training pipeline exploits the features of the training data for the best reconstruction compared to a general work-for-all fixed CFA.
CFA Learning Module | Unconstrained | Constrained | ||
---|---|---|---|---|
Linear | Weighted SoftMax | HardMax | ||
Kodak | PSNR | 49.435 | 39.034 | 40.451 |
SSIM | 0.9987 | 0.9782 | 0.9844 | |
BSD500 | PSNR | 49.426 | 37.819 | 40.054 |
SSIM | 0.9991 | 0.9637 | 0.9897 |
CFA Learning Module | Unconstrained | Constrained | ||
---|---|---|---|---|
Linear | Weighted SoftMax | HardMax | ||
Kodak | PSNR | 49.389 | 39.407 | 41.881 |
SSIM | 0.9982 | 0.9788 | 0.9871 | |
BSD500 | PSNR | 49.907 | 38.074 | 41.181 |
SSIM | 0.9989 | 0.9839 | 0.9918 |
5.4 Comparison with Alternative CFA Learning Methods
In this analysis, we compare the performance of the proposed HardMax CFA learning module with the state-of-the-art DL-based CFA learning approaches [19, 22] which are summarized in Section 2.1. The approach in [19] learns unconstrained linear weights for each pixel rather than constraining the solution space to only binary weights or single channel selection at each pixel as in the proposed HardMax method and the Weighted SoftMax in[22]. It is expected that learning unconstrained linear weights will provide better performance since it has a much wider optimization space. However, we compared all three approaches over the same training and test dataset combinations. All the alternative joint CFA learning-demosaicing methods were trained and tested as they were proposed in their respective studies. The achieved PSNR and SSIM results can be found in Table 4 for both RGB and RGBW configurations.
The unconstrained linear CFA outperforms the constrained binary CFAs in all compared metrics under both RGB and RGBW configurations. Linear filter in [19] was expected to outperform both binary filters since it has a larger optimization space and is not bound to selecting one single color channel per pixel. If it is possible to construct a camera that can acquire measurements at each pixel as linear weighted combination of each color channel with weights learned through the CFA learning it would be optimal. However, for most practical cameras, which are constrained to observe only one of the color channels for each pixel at a time, we can compare the performance of the proposed approach and the Weighted SoftMax in [22]. This comparison can be found under the constrained column of Table 4 and it can be seen that the proposed HardMax CFA learning approach surpasses the Weighted SoftMax by 1.42 dB in RGB case and 1.48 dB in RGBW case and also results in higher percentage values for the SSIM scores for both test datasets.
We illustrate four example test images chosen from the two test datasets in Fig. 7 along with their reconstructions from fixed Bayer CFA, and learned CFAs from Weighted SoftMax and the proposed HardMax approach. The achieved PSNR values for each individual image are shown in the same figure and it can be seen that the proposed joint architecture results in higher PSNR values than fixed Bayer or Weighted SoftMax based learned CFAs. In addition to the PSNR metric, visual comparisons can be done on Fig. 7 between the original images and the reconstructions from different CFAs.
5.5 Analysis on the Learned CFA in Training
As we establish the high performance of the proposed joint binary CFA learning and the demosaicing model, in this analysis we start looking into the behavior of the model during the training process. For this analysis, we observe the effect of the training by comparing the performance metrics, reconstruction qualities, and learned filters at different epochs as the training progresses. In this test, one sample of the proposed joint model was trained, and its weights were saved after epochs 1, 2, 4, 5, 10, 15, 20, and 50. These saved weights were used with the test dataset to evaluate the change in image reconstruction quality.
Epoch | 1 | 2 | 4 | 5 | 10 | 15 | 20 | 50 | |
---|---|---|---|---|---|---|---|---|---|
Kodak | PSNR | 27.2751 | 31.6592 | 35.9546 | 37.4252 | 39.2274 | 40.0207 | 39.9918 | 40.4209 |
SSIM | 0.8406 | 0.8967 | 0.9591 | 0.9677 | 0.9795 | 0.983 | 0.9833 | 0.984 | |
BSDS500 | PSNR | 26.6396 | 30.2478 | 34.5898 | 36.3041 | 38.2320 | 39.1354 | 39.3217 | 39.8671 |
SSIM | 0.8557 | 0.907 | 0.9624 | 0.9737 | 0.9836 | 0.9869 | 0.9878 | 0.9887 |
Epoch | 1 | 2 | 4 | 5 | 10 | 15 | 20 | 50 | |
---|---|---|---|---|---|---|---|---|---|
Kodak | PSNR | 27.9818 | 26.8873 | 37.4772 | 38.5111 | 39.7669 | 39.7566 | 40.6076 | 40.6922 |
SSIM | 0.8312 | 0.874 | 0.9725 | 0.9747 | 0.9819 | 0.9843 | 0.9853 | 0.9858 | |
BSDS500 | PSNR | 27.9691 | 27.0902 | 37.3992 | 38.4566 | 40.0099 | 40.2789 | 41.0271 | 41.3370 |
SSIM | 0.8497 | 0.879 | 0.9775 | 0.9803 | 0.9866 | 0.9889 | 0.9899 | 0.9908 |
Table 5 shows the performance of the model at different epochs. The results show that both PSNR and SSIM metrics improve with increasing epochs and converge to their final values. The PSNR after a single epoch is at 27dB, and after epoch 20 this result was improved to approximately 40dB and the final average PSNR at epoch 50 is 40.4dB over the test dataset. As the training progresses, the model does a better job at reconstructing the images while the CFA pattern is also forming in training. Figure 8 shows an example reconstructed image at different epoch numbers. While the reconstruction artifacts are clearly visible at early epochs, after 15 to 20 epochs, image reconstruction performance improves, and finally a dB PSNR is achieved after epoch 50.
5.6 Effect of Training Dataset Size on Learned CFAs
In the sixth and final analysis, our goal is to understand the effect of the training data size on the learned CFA and the performance of the demosaicing model. We created training datasets with K, K, and Million samples of size image blocks. For each dataset size, the proposed model is trained 10 independent times with random weight initialization.
Table 6 shows the mean and variance in the PSNR and SSIM metrics over Kodak and BSDS500 datasets. As the size of the training dataset increases, both the average PSNR and SSIM metrics improve. Another important observation is that the variance in the achieved performances decreases as models are trained over larger datasets. This shows that even though each training might end up with a different learned CFA and a demosaicing model, the achieved performances with these models are mostly consistent with lower variance in larger datasets.
Dataset Size | 1 Mil. | 500K | 250K | ||||
---|---|---|---|---|---|---|---|
Avg | Var | Avg | Var | Avg | Var | ||
Kodak | PSNR | 40.357 | 0.0297 | 39.866 | 0.1423 | 39.100 | 0.2218 |
SSIM | 0.985 | 0.00196 | 0.983 | 0.00139 | 0.97 | 0.00357 | |
BSDS500 | PSNR | 39.843 | 0.1330 | 39.217 | 0.2332 | 38.245 | 0.2967 |
SSIM | 0.989 | 0.00138 | 0.988 | 0.00182 | 0.984 | 0.00333 |
One example of learned CFA for each training dataset size is illustrated in Fig. 9. It can be seen that the learned CFA over the largest training dataset shows more checker-like patterns, more uniformly sampling each color channel. Since each independent training case might result in different CFAs, it is not proper to state that there is only one optimal CFA. Each of our independent training cases resulted in different CFAs and demosaicing models. However, our observation is that the variance on both the learned CFAs as well as the achieved performance in terms of PSNR and SSIM metrics gets lower as training datasets are larger. Due to limited computational resources, we are able to train our models over a maximum of 1 million image patches. However, training the proposed architecture over much larger datasets would be informative on whether the observed trends in performance increases and lower variances continue for larger dataset cases.
6 Discussion and Future Work
In this paper, we demonstrate a joint architecture both learning a binary color filter array together with demosaicing with a deep neural network. Our results show that the learned CFAs with the proposed architecture result in enhanced reconstruction performance compared to classical fixed CFAs such as Bayer. Since the proposed approach learns to select a single color channel at each pixel, learned CFAs are practical and physically implementable in digital cameras.
In this section, we would like to discuss the proposed approach and its results in terms of its implications and potential future work. First, the proposed architecture is composed of two submodules: one for learning a binary CFA and the other for reconstructing a full-color image from the sampled CFA output. Both of these modules are learned when we train the architecture jointly. Hence the learned CFA is dependent on the task of the second module, which is demosaicing. Suppose another task such as classification is utilized with a neural network architecture in the second module. In that case, the proposed joint architecture can learn a CFA that would be optimal for that task. Hence the proposed joint architecture allows task-dependent CFA learning which could have other future applications.
The results from Section 5.2 shows that an optimal CFA size of is observed for the proposed architecture. Any higher filter size resulted in a reduction in reconstruction quality. This could mean that larger and more complex CFA patterns might not be necessary for higher reconstruction quality. An important point is that the any machine learning model depends on the training data and are optimal for that dataset. While this could result in more distinct CFAs for specific applications, it also means a larger training dataset is required to learn a more generalized CFA. The results in Section 5.6 show that with increased training dataset sizes even though different CFAs are learned, the average reconstruction performance increases, and the variance in the performance gets smaller. We believe training of the proposed architecture over much larger datasets has the potential to lead to more generalized learned CFAs.
An interesting observation is the bias toward the blue channel in the learned CFAs. This finding contrasts with the idea of prioritizing the green channel in hand-designed filters as the more informative channel, especially taking into account that green is the least selected color in almost all the learned filters. It is important to include that a few filters end up having more red pixels, but green almost always appears as the least selected pixel. More extensive comparisons with alternative CFA learning methods and better analyses might lead to more definitive answers.
Future work on this study includes analysis on the effect of noise and cross-talk, implementation and analysis of the HardMax module with neural network models for various computational imaging tasks for learning task-specific CFAs, and potential hardware implementation of the proposed filters for a more realistic analysis.
7 Conclusion
This study presents a binary CFA learning module based on hard thresholding with a deep learning-based demosaicing network in a joint architecture. While a measurement learning approach based on gradient adaptation is developed for binary CFA learning, a demosaicer architecture based on novel DL-based image reconstruction models is jointly learned. The proposed model is trained and tested over Kodak and BSDS500 datasets. Since the proposed approach learns to select a single color channel at each pixel, the learned CFA is easily adaptable to modern commercial cameras. Both RGB and RGBW CFAs can be learned with the proposed approach and increased reconstruction performance in PSNR and SSIM metrics are achieved compared to both fixed well-known filters such as Bayer or alternative learned filters.
References
- [1] B. Bayer, “Color imaging array,” United States Patent, no. 3971065, 1976.
- [2] R. Lukac and K. N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer electronics, vol. 51, no. 4, pp. 1260–1267, 2005.
- [3] T. Seiji, “Color imaging apparatus,” United States Patent, no. 8531563, 2013.
- [4] K.-L. Chung, T.-H. Chan, and S.-N. Chen, “Effective three-stage demosaicking method for rgbw cfa images using the iterative error-compensation based approach,” Sensors, vol. 20, no. 14, p. 3908, 2020.
- [5] B. K. Gunturk, J. Glotzbach, Y. Altunbasak, R. W. Schafer, and R. M. Mersereau, “Demosaicking: color filter array interpolation,” IEEE Signal processing magazine, vol. 22, no. 1, pp. 44–54, 2005.
- [6] H. S. Malvar, L.-w. He, and R. Cutler, “High-quality linear interpolation for demosaicing of bayer-patterned color images,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. iii–485, IEEE, 2004.
- [7] R. Lukac, K. N. Plataniotis, D. Hatzinakos, and M. Aleksic, “A novel cost effective demosaicing approach,” IEEE Transactions on Consumer Electronics, vol. 50, no. 1, pp. 256–261, 2004.
- [8] R. Kimmel, “Demosaicing: Image reconstruction from color ccd samples,” IEEE Transactions on image processing, vol. 8, no. 9, pp. 1221–1228, 1999.
- [9] X. Li, “Demosaicing by successive approximation,” IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 370–379, 2005.
- [10] R. Lukac, K. N. Plataniotis, D. Hatzinakos, and M. Aleksic, “A new cfa interpolation framework,” Signal processing, vol. 86, no. 7, pp. 1559–1579, 2006.
- [11] X. Li, B. Gunturk, and L. Zhang, “Image demosaicing: A systematic survey,” in Visual Communications and Image Processing 2008, vol. 6822, pp. 489–503, SPIE, 2008.
- [12] M. Safna Asiq and W. Sam Emmanuel, “Colour filter array demosaicking: a brief survey,” The Imaging Science Journal, vol. 66, no. 8, pp. 502–512, 2018.
- [13] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand, “Deep joint demosaicking and denoising,” ACM Transactions on Graphics (ToG), vol. 35, no. 6, pp. 1–12, 2016.
- [14] F. de Gioia and L. Fanucci, “Data-driven convolutional model for digital color image demosaicing,” Applied Sciences, vol. 11, no. 21, p. 9975, 2021.
- [15] B. Park and J. Jeong, “Color filter array demosaicking using densely connected residual network,” IEEE Access, vol. 7, pp. 128076–128085, 2019.
- [16] D. S. Tan, W.-Y. Chen, and K.-L. Hua, “Deepdemosaicking: Adaptive image demosaicking via multiple deep fully convolutional networks,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2408–2419, 2018.
- [17] F. Kokkinos and S. Lefkimmiatis, “Iterative joint image demosaicking and denoising using a residual denoising network,” IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4177–4188, 2019.
- [18] J. Tang, J. Li, and P. Tan, “Demosaicing by differentiable deep restoration,” Applied Sciences, vol. 11, no. 4, p. 1649, 2021.
- [19] B. Henz, E. S. Gastal, and M. M. Oliveira, “Deep joint design of color filter arrays and demosaicing,” in Computer Graphics Forum, vol. 37, pp. 389–399, Wiley Online Library, 2018.
- [20] L. Bian, Y. Wang, and J. Zhang, “Generalized msfa engineering with structural and adaptive nonlocal demosaicing,” IEEE Transactions on Image Processing, vol. 30, pp. 7867–7877, 2021.
- [21] F. Zhang and C. Bai, “Jointly learning spectral sensitivity functions and demosaicking via deep networks,” in 2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC), pp. 404–411, IEEE, 2021.
- [22] A. Chakrabarti, “Learning sensor multiplexing design through back-propagation,” Advances in Neural Information Processing Systems, vol. 29, 2016.
- [23] R. Jacome, J. Bacca, and H. Arguello, “Deep-fusion: An end-to-end approach for compressive spectral image fusion,” in 2021 IEEE International Conference on Image Processing (ICIP), pp. 2903–2907, IEEE, 2021.
- [24] R. Wu, Y. Li, X. Xie, and Z. Lin, “Optimized multi-spectral filter arrays for spectral reconstruction,” Sensors, vol. 19, no. 13, p. 2905, 2019.
- [25] T. W. Sawyer, M. Taylor-Williams, R. Tao, R. Xia, C. Williams, and S. E. Bohndiek, “Opti-msfa: a toolbox for generalized design and optimization of multispectral filter arrays,” Optics Express, vol. 30, no. 5, pp. 7591–7611, 2022.
- [26] K. Li, D. Dai, and L. Van Gool, “Jointly learning band selection and filter array design for hyperspectral imaging,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6384–6394, 2023.
- [27] J. Li, C. Bai, Z. Lin, and J. Yu, “Optimized color filter arrays for sparse representation-based demosaicking,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2381–2393, 2017.
- [28] C. Bai, F. Liu, and J. Li, “Joint learning of rgbw color filter arrays and demosaicking,” Available at SSRN 4753575.
- [29] H. Arguello, J. Bacca, H. Kariyawasam, E. Vargas, M. Marquez, R. Hettiarachchi, H. Garcia, K. Herath, U. Haputhanthri, B. S. Ahluwalia, et al., “Deep optical coding design in computational imaging,” arXiv preprint arXiv:2207.00164, 2022.
- [30] J. Bacca, T. Gelvez-Barrera, and H. Arguello, “Deep coded aperture design: An end-to-end approach for computational imaging tasks,” IEEE Transactions on Computational Imaging, vol. 7, pp. 1148–1160, 2021.
- [31] K. Hirakawa and P. J. Wolfe, “Spatio-spectral color filter array design for enhanced image fidelity,” in 2007 IEEE International Conference on Image Processing, vol. 2, pp. II–81, IEEE, 2007.
- [32] W. Lu and Y.-P. Tan, “Color filter array demosaicking: new method and performance measures,” IEEE transactions on image processing, vol. 12, no. 10, pp. 1194–1210, 2003.
- [33] R. Kakarala and Z. Baharav, “Adaptive demosaicing with the principal vector method,” IEEE Transactions on Consumer Electronics, vol. 48, no. 4, pp. 932–937, 2002.
- [34] J. Portilla, D. Otaduy, and C. Dorronsoro, “Low-complexity linear demosaicing using joint spatial-chromatic image statistics,” in IEEE International Conference on Image Processing 2005, vol. 1, pp. I–61, IEEE, 2005.
- [35] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE transactions on image processing, vol. 26, no. 9, pp. 4509–4522, 2017.
- [36] C. O. Ayna and A. C. Gurbuz, “Learning optimum binary color filter arrays for demosaicing with neural networks,” in Real-Time Image Processing and Deep Learning 2024, vol. 13034, pp. 174–182, SPIE, 2024.
- [37] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
- [38] J. Feng, J. Chen, Q. Sun, R. Shang, X. Cao, X. Zhang, and L. Jiao, “Convolutional neural network based on bandwise-independent convolution and hard thresholding for hyperspectral band selection,” IEEE Transactions on Cybernetics, vol. 51, no. 9, pp. 4414–4428, 2020.
- [39] W. Shi, F. Jiang, S. Liu, and D. Zhao, “Image compressed sensing using convolutional neural network,” IEEE Transactions on Image Processing, vol. 29, pp. 375–388, 2019.
- [40] R. Mdrafi and A. C. Gurbuz, “Joint learning of measurement matrix and signal reconstruction via deep learning,” IEEE Transactions on Computational Imaging, vol. 6, pp. 818–829, 2020.
- [41] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.
- [42] R. W. Franzen, “True color kodak images.”
- [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.