1. Introduction
Texture is usually divided into two categories, tactile and visual textures. Tactile texture refers to the tangible feel you obtain from touching a surface. Visual texture is the impression an observer obtains from perceiving a texture. Visual texture is related to local spatial variations of simple stimuli such as color, orientation, and intensity in an image [
1]. This work focuses on visual textures and investigates different methods to classify them.
Texture analysis and classification play a major role in several applications related to material appearance and surface recognition. Examples of successful use of texture can be found in multiple fields such as food science [
2], defect detection [
3], and medicine [
4,
5]. In computer graphics, texture is a fundamental feature to allow faithful rendering of material, and it is important to understand how humans perceive texture and how it can be processed, rendered, simulated, or even reproduced [
6,
7]. Generally, any application that includes texture will benefit from better and more accurate texture classification and analysis algorithms.
Most texture feature extraction approaches focus on the images produced by color cameras, but some research showed the benefits of additional information for texture analysis, such as adding more spectral bands [
8]. Studies investigated the effectiveness of using multi- or hyper-spectral imaging devices for texture analysis as they add more information [
8,
9] but these devices come with additional acquisition complexity and an increase in processing time because of the additional information.
Several imaging techniques allow the capture of spectral images [
10]. Spectral image acquisition results from sampling the scene in three axes: spatial, spectral, and time. Most techniques provide a sequential capture of images with the main drawback of having moving parts in the camera and problems when the camera or the object is moving.
To solve this problem, Spectral Filter Array (SFA) [
11] was introduced. This technology uses the same concept as the Color Filter Array (CFA) that is commonly used in color cameras. Similar to CFA, SFA is a single-sensor camera that has different spectrally selective filters in front of each pixel to measure specific bands of the electromagnetic spectrum. The number of bands depends on the pattern used for the SFA. The pattern design is very important since it impacts the full-resolution image reconstruction and defines how selective the camera is. Examples of patterns can be found in the literature, e.g., [
12,
13,
14], and which kind of pattern is the best one is still an open question.
The estimation of the missing bands for each pixel is called demosaicing. Indeed, the SFA sacrifices spatial resolution to gain spectral and temporal advantages. The spatial accuracy thus depends strongly on the demosaicing process. Demosaicing can be considered as a specific case for interpolation, and a huge body of literature investigates how to conduct spatial resolution reconstruction. This work focuses on performing texture classification from raw SFA images without performing demosaicing. This allows for a faster and simpler method and will have the advantage of utilizing the correlation between the different bands without more computation. This is relevant since the demosaicing process does not add additional information to the data captured. Additionally, in this work, we focus on testing the effectiveness of the deep learning/convolutional neural network (CNN) based methods for SFA texture classification directly on the raw image and how it generalizes with the variation of illumination conditions, and exposure time and how it compares with the classical approaches that are based on LBP method.
In the next
Section 2, we will describe the related work of texture classification and some of the research done in this area. In
Section 3, we will describe the datasets and methods used during our experimentation setup. Finally, in
Section 4, we will represent the results we obtained and the conclusions we perceived from the conducted experiments.
2. Related Work
During the initial era of digital image processing, most of the literature proposed to compute texture features on a single-channel image, such as a luminance channel. With the generalization of the use of color images, researchers showed that the texture features quality extracted from luminance was improved by considering the color channels and the distribution of pixel values in a color space in addition to the spatial arrangement of the colors in the image plane [
15]. This method increased the performance and the quality of the feature extracted and also increased the computational time and the number of characteristic features. Most methods today utilize color images, so new methods were created, and former methods like Local Binary Pattern (LBP) were extended to color images. Additionally, different varieties of LBP were introduced that are more robust to illumination change and noise [
16]. The success of learning-based methods, especially deep learning and CNN, on different computer vision tasks, such as image classification or enhancement, encouraged researchers to adapt them to more vision and perception tasks, including texture analysis and classification [
17]. One current problem with learning-based methods is the availability of the texture dataset and the performance of the model, which is directly connected to the quality of the dataset and the varieties of textures included. More effort was investigated to see the effect of having extended spectral information on texture analysis. Khan et al. [
9] created a high resolution in the spectral and the spatial domain for texture data to study texture analysis utilizing a hyper-spectral imaging device. They also analyzed the effect of using a higher number of bands and showed the importance of the methods that use the correlation between different bands and not only depend on the spatial information of the data. Similarly, Conni et al. [
8] studied the effect of the number of bands on the performance of different texture analysis methods. With the performance enhancement from using more bands in texture analysis and classification, more methods were developed, and other methods were adjusted to be used with hyperspectral and multispectral texture data such as k-band LBP [
9], CNN [
18], and Co-occurrence matrix [
19].
Methods that depend on color images generally depend on the demosaiced image. The demosaicing process was designed to produce a standardized color image format for storage, communication, and visualization purposes, so they often use filtering to avoid visible color artifacts, and many of the efficient methods rely on the frequency domain. These methods tend to alter the local texture information of the image, which reduces the texture information we can obtain from the image. Losson et al. [
20] investigated this problem and proposed an LPB variant method that works directly on the CFA. They showed that this approach produces better results with less complexity, so they demonstrated that applying texture classification directly on the raw CFA image was a relevant approach. However, this comes with a lack of generalization since the algorithm needs to be adapted to each mosaic pattern. Mihoubi et al. [
21] extended this concept to spectral and developed an LBP method that works directly on SFA. Their experiments showed that the method works better than most of the methods that require demosaicing with a much better computational time.
Very recently, Amziane et al. [
22] proposed a CNN-based network for texture classification and analysis. Their model consists of three different CNN blocks with max pooling in between and a fully connected network at the end for texture classification. Their model works directly with the MSFA array without demosaicing. Their experimentation is based on a simulated dataset using the HyTexiLa dataset [
9] by simulating two different SFA patterns without any variation in illumination and without capturing artifacts or environmental effects. This work was published after we finalized our experimental work, so we have not taken it into account while designing our research.
Our work also investigates the performance of a CNN architecture for texture classification on the same texture data. At the inverse of Amziane et al. [
22], we do not perform simulation, and we collected our own SFA dataset with an actual sensor, SILIOS CMS-C [
23] under five different illuminations and three different exposure times. In addition, our dataset includes capturing artifacts effects that can be noticed during real applications, such as lens distortion, out-of-focus regions, and vignetting. The addition of different illumination and exposures allowed us to investigate the robustness of our CNN-based model to different illuminations and brightness and showed the ability of our method to work on different illuminations and exposures than the training data. We did not include the results from Amziane et al. [
22] as the dataset used is different–simulated data–and the code for their method is not available to train it on our SFA dataset. However, hopefully, the release of our dataset will encourage the use of a real SFA dataset for texture classification methods and will create a new benchmark that will make it possible to compare between different methods developed and will give a better insight into how the methods will behave in actual capturing environments. Additionally, we visualize the saliency map of our model to show what parts of the texture the model focuses on to make the classification decision and showed that our method could differentiate between the different patterns and attributes of different textures.
3. Methodology
3.1. Dataset
Our experimentation is based on two different texture image datasets of the same materials. The first dataset is HyTexiLa [
9]. This dataset is a high-resolution hyper-spectral dataset for different textured materials from 5 different categories. The dataset was collected using HySpex VNIR-1800 [
24], a line scanner hyper-spectral imaging device. The dataset consists of 112 textured materials, 65 of which are textile-textured materials. The spatial resolution of the dataset is 1024 by 1024 with a spectral resolution of 186 bands from the range [400–1000] nm, so the dataset is very high resolution in both spectral and spatial dimensions. The dataset is collected from closeup texture samples, so it is very detailed. The captured data were transformed to reflectance data so they are illumination invariant. For each texture sample, only one capture was taken. An image from the acquisition setup is shown in
Figure 1. For this dataset, we only focused on the textile textures materials.
The second dataset is a new dataset of the same textile materials captured in the HyTexiLa dataset we collected using the (SFA) SILIOS CMS-C [
23]. The sensor used has a spatial resolution of 1280 by 1024 and captures nine spectral bands from the range [430–700] nm. The SFA pattern and the spectral bands of the SILIOS CMS-C sensor are shown in
Figure 2. This dataset is much less detailed than the first and corresponds to what can really be acquired in the field by today’s technologies. The dataset was captured under five different lighting conditions within a viewing booth (An illuminant, cool white, daylight, horizon, and TL84) and with three different exposure times (20, 50, and 70 ms) for each lighting condition which resulted in 15 different captures for each texture sample. For this dataset, we only have one image capture for each texture under a specific lighting condition and exposure time; additionally, the reader will notice parts of the image that are not sharp or in focus, noise, and other capturing artifacts that can happen during real-world capturing. In addition, the texture images are not at a similar scale to the HyTexiLa dataset (the HyTexiLa dataset capture is zoomed in, but the new dataset has a much wider field of view). Differences between the two dataset images are shown in
Figure 3. The data will be made available as
Supplementary Materials. In addition, the texture images are not at a similar scale to the HyTexiLa dataset (the HyTexiLa dataset capture is zoomed in, but the new dataset has a much wider field of view). Differences between the two dataset images are shown in
Figure 3. The data will be made available as
Supplementary Materials.
For both datasets under specific capturing conditions, we only have one capture for each texture class, so in order to use the datasets for training and evaluation, we needed to split each capture into small patches. This choice resulted in having 25 image patches for each texture class under specific capturing conditions, and in the case of the new dataset, if we included all the lighting conditions and exposure times, we would have 375 image patches. This dataset size is very small, especially for deep learning-based methods, in order for them to generalize. We later show how we tried to overcome this issue and how the deep learning-based methods performed with this small amount of data.
3.2. Hyper-Spectral Setup
We first test the performance of texture classification in ideal conditions on the HyTexiLa data. This performance will set our desired goal when working with real SFA data. The ideal algorithm would give the same performance, as in this ideal case, in different environmental conditions. In the HyTexiLa paper, the authors tested different LBP algorithms that were extended to hyperspectral LBP. These can be applied to the k-number of channels in spectral images. They also tested the effect of the number of bands on the performance of these algorithms. Their experimentation showed that The Opponent Band LBP (OBLBP) was the best-performing algorithm, and the performance stabilized after 10 bands. We followed their experimental setup for their best-performing model, and we recreated their results.
For the preparation of the dataset, we selected 10 bands from the 186 bands of the HyTexiLa. The bands we selected are equally spaced and cover the full spectral range [400–1000] nm. For each texture material, we split its image of size 1024 × 1024 × 10 (number of bands we selected) into 25 non-overlapping batches of size 204 × 204 × 10. Twelve of these batches were randomly selected for training, and the other 13 were used as test images. The OBLBP [
25] algorithm was extended to work with k-channels. In the original LBP operator, the LBP operator is applied on each channel separately by only comparing a central pixel to its neighboring from the same channel, and for each channel, you obtain a feature vector and combine all these feature vectors to obtain the final features of size
. For OBLBP, one considers the inter-channel correlation by applying the LBP operator for each channel pair. So, instead of comparing a central pixel to its neighboring pixel from the same channel, you compare it to its neighboring pixels from another channel, which result in a feature vector of size
. The mathematical formulation of the extended method is shown in Equation (
1).
where is
represents LBP feature between channel
k and
l and pixel
p.
represent central pixel in channel k and
represent neighboring pixels in channel l. Finally
s, in Equation (
2), represents the comparison operation.
For classification, we used a 1-Nearest Neighbor decision algorithm. It works by testing the similarity between each test batch’s features to the features of the batches in the training set using the intersection between histograms algorithm. For each test batch, the most similar batch from the training batches will be the same texture as the test batch.
3.3. SFA Setup
In this setup, we classify the SFA dataset. We used a similar setup as the hyper-spectral data, with some modifications to account for the dataset differences. We work with raw images, so the depth of the image we will use is 1, and the spectral information is included in the pattern of the SFA. Because the field of view of the capture in the SFA data is larger, some parts of the texture images do not include texture, as we can notice in
Figure 3. We took the middle part of the image of size 510 × 510 as the texture image that we will work with. For this texture image, we split it into 25 non-overlapping batches of size 102 × 102. Twelve of these batches were randomly selected for training, and the other 13 were used as test images. Even though the size of the batches in the SFA data is smaller than the batches in the HyTexiLa data, the field of view in the SFA data will be larger since the field of view in the SFA texture image is much larger than the HyTexiLa texture image field of view.
For the algorithms used for texture features extractor and classification, we tested two different algorithms. The first algorithm is proposed by Mihoubi et al. [
21], which is a modified LBP algorithm that works directly on the SFA raw image. The second method is the one proposed by us, which is a learning-based method that utilizes CNN to extract the texture features. The two methods are described next in detail.
3.3.1. LBP SFA
Mihoubi et al. [
21] developed an LBP-based method that works directly with the raw SFA images. This method is analogous to the OBLBP method that was used with the HyTexiLa dataset. This algorithm works by creating a separate LBP histogram for each different band in the SFA, so for SFA with
K bands, we will have
K histograms. To create the histogram from the
band, we select the pixels with this band
to compute this histogram. For these pixels to compute the LBP value, we compare these pixels to the neighboring pixels from different bands. By comparing these pixels to the neighboring pixels from different bands, we also consider the inter-correlation between the different bands in the SFA capture. The concatenation of all the
K histograms represents the final texture features of size
. The calculation steps are shown in
Figure 4. This algorithm was named MSFA-based LBP (MLBP) and follows Equation (
3).
where
is the LBP value calculated for pixel
p.
are the neighbors of the pixel
p from different bands. Histogram calculations are described in Equation (
4).
where
is the histogram of band
k calculated from
pixel subset.
For classification, we used 1-Nearest Neighbor, similar to the model we used for the HyTexiLa Dataset.
3.3.2. NN SFA
For our proposed method, we use CNNs to develop a learning-based method based on VGG-11 [
26] architecture. We adapted the architecture to our problem by modifying the first layers to work with 1-channel raw images and decreasing the size and shape of the fully connected layers to only one layer to decrease the model size. We choose VGG-11 as it is simple and relatively small in size compared to other CNN architectures commonly used. The smaller size will be beneficial in our case because of the small amount of training data we have. We considered the raw SFA image as a grayscale image. The task formulation is described in Equations (
6) and (
7).
where
F represents the model used for the classification task.
represents the prediction of the model.
represents the optimal parameters for the model, and we can obtain them by choosing the parameters that minimize the error between the model prediction and the ground truth.
L represents the loss function that we use to calculate the model error and optimize its performance.
The texture dataset we have is very small since we only have 25 image patches for each texture. Deep learning models require a large amount of data to produce a robust model with good performance. To overcome this problem, we used the VGG-11 model trained on ImageNet [
27] dataset as our initial weights (pre-trained model). So instead of training the model from the beginning, the new model will utilize the weights that were generated to extract features from the ImageNet dataset and adapt it to work with the texture dataset. This is very helpful when working with a small dataset similar to our case. Even though the pre-trained weights on ImageNet are for a very different task and the images are very different domains, so the leaned features for the two tasks will be very different, we found that starting from this pre-trained model stabilizes the model training and decreases the number of epochs needed for training. We also utilized data augmentation by random rotation, flipping, and brightness to the input patch. We then trained the model for the texture dataset and compared it to the performance of the LBP SFA algorithms and the performance of the OBLBP on the HyTexiLa dataset.
3.4. Evaluation Metric
To evaluate the performance of the different algorithms on texture classification, we used F1-score metric. The F1-score calculation depends on the computation of precision and recall. Precision (Equation (
8)) represents how the model prediction is precious by calculating all the positive predictions that the model made and how many of them were correct. Recall (Equation (
9)) represents the model’s ability for correct prediction by representing from all the positive samples in the test set how many of them the model predicted correctly. F1-score (Equation (
10)) is then calculated as the harmonic mean of the precision and the recall.
To calculate the F1-score for a classification task with multiple classes, we first calculate F1-score for each class separately and average the F1-score for all the classes to obtain the model F1-score. To calculate each class F1-score separately, we consider all test samples for this class as positive samples and the rest of the test samples to be negative samples.
We also produce the confusion matrix [
28] to visualize the model decision and see, for each texture, what are the other textures that the model confuses this texture with. Because of the large size of the confusion matrix, it will be only included in the report when necessary, but the confusion matrix for all the experiments was analyzed, and the important conclusions from this analysis are mentioned for each experiment.
5. Conclusions
This article focuses on texture classification directly applied to raw Spectral Filter Array images without a preliminary step of demosaicing. We proposed a method for SFA texture classification based on Convolutional Neural Networks. This CNN is pre-trained on ImageNet, then tuned on spectral data. The model performance was compared with the state-of-the-art methods for SFA texture classification on raw images. Differently from the other works that simulate the SFA data from hyper-spectral data, we used a dataset that was captured with an SFA sensor. This dataset allows us to evaluate the model’s performance in real environmental conditions.
Additionally, we investigated the impact of exposure and illumination on the performance of different methods. Our experiment showed the strong effect of different lighting conditions on changing the features extracted from the texture SFA raw image. Our model performance was better in the majority of the cases and was better for adapting to changes in illumination and intensity by increasing the variety in the training data. All the tested models struggled to recognize textures under different illumination than the training setup, but our model showed a better ability to adapt to the addition of data and would probably adapt very well with an increase in data with illumination varieties. Even though the dataset was fairly small compared to what is usually needed, the CNN-based method performance was still better, which shows the ability of CNN architecture to recognize patterns and extract features.
Our work shows once again that illumination is a key factor in imaging. A future direction for this work would be to see how the performance would vary if we embed the concept of spectral constancy in the architecture.