poster

Free access

TPSegmentDiff: An Enhanced Diffusion Model for Tactile Paving Image Segmentation

Authors:

Xu Lang,

Nong SangAuthors Info & Claims

MMASIA '24 Workshops: Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops

Article No.: 14, Pages 1 - 6

https://doi.org/10.1145/3700410.3702130

Published: 26 December 2024 Publication History

PDF eReader

Abstract

The traditional diffusion model is widely used in image generation. This study enhanced the traditional diffusion model and introduced a diffusion-based method for tactile paving image segmentation called TPSegmentDiff. We also proposed a voting mechanism to reduce the random error caused by the random distribution of the initial noise, improving the performance of TPSegmentDiff. In particular, in the TPSegmentDiff-DDPM algorithm, the performance is improved by about 5% after applying the voting mechanism on the MIoU metric. Similarly, the F1 score of the TPSegmentDiff-DDPM model is improved by about 4%.

Figure 1:

1 Introduction

Effective recognition and segmentation of tactile paving in computer vision remain challenging due to the diversity of environments and the varying wear conditions of the tactile paving itself. While effective in simple scenarios, traditional image segmentation methods often need help in complex or dynamic environments, limiting the effectiveness and reliability of intelligent assistive devices in real-world applications[9]. Currently, methods for tactile paving segmentation can be divided into traditional image processing methods and deep learning-based methods.

Traditional methods typically use edge detection (such as Canny edge detection) and morphological processing (such as erosion and dilation) to identify the lines and texture features of tactile paving. Another common approach is color and texture analysis, which uses the color and texture characteristics of tactile paving, employing color space transformation and texture analysis techniques for segmentation. These methods perform poorly under varying lighting conditions and in complex backgrounds[6].

Deep learning introduces new approaches to tactile paving segmentation. Convolutional Neural Networks (CNNs) for tactile paving segmentation can significantly improve detection accuracy and robustness. Typical CNNs like FCN, U-Net, and SegNet have been widely applied to image segmentation tasks, automatically extracting significant features of objects through end-to-end learning[8][11][1]. Deep semantic segmentation networks such as DeepLab and PSPNet, by introducing techniques like atrous convolution and spatial pyramid pooling, further enhance the precision and detail retention of tactile paving segmentation[12]. Transfer Learning leverages models pre-trained on large datasets to achieve good segmentation results on small tactile paving datasets, effectively reducing the need for extensive labeled data[4]. Multimodal Fusion combining RGB images and depth images can enhance the robustness and accuracy of tactile paving segmentation. Depth information provides additional geometric features, improving segmentation performance in complex scenes[16].

This study aims to propose a high-precision tactile paving image segmentation method based on diffusion models, named TPSegmentDiff. This method has the potential to significantly improve navigation for visually impaired individuals in complex urban environments. It not only increases the accuracy of the segmentation algorithm but also enhances its adaptability to changes in complex environments, such as varying lighting conditions and the wear state of tactile paving.

The remainder of the paper is organized as follows: Section 2 reviews related work in image segmentation and diffusion models. Section 3 introduces the fundamental concepts of U-Net and diffusion models. Section 4 presents our method, TPSegmentDiff, including adapting diffusion models for tactile paving segmentation. Section 5 presents the experimental analysis, followed by a comparison with existing methods. Finally, Section 6 concludes the paper and suggests future research directions.

2 Related work

2.1 Image Segmentation

Image segmentation involves dividing an image into multiple regions with distinct features. Early image segmentation techniques were primarily based on simple threshold processing, such as the Otsu method, which determines the optimal threshold by maximizing inter-class variance[20]. More complex edge detection algorithms, like the Sobel and Canny algorithms, handle edge information in images more effectively[13][2].

Advancements in deep learning have significantly propelled the progress of image segmentation techniques. Notably, the U-Net architecture proposed by Ronneberger et al. effectively enhances local and global image information processing through deep networks and skip connections, significantly improving segmentation outcomes in fields like medical imaging[12].

Recent research has made remarkable progress in the field of image segmentation. Oktay et al. introduced the Attention U-Net, which incorporates an attention mechanism to selectively focus on important regions in the image, thereby improving segmentation accuracy and further enhancing U-Net’s performance in medical image segmentation[10].

Chen et al. developed DeepLabV3+, which combines Atrous Spatial Pyramid Pooling (ASPP) with encoder-decoder architecture, significantly enhancing the accuracy and boundary detail handling of image segmentation, particularly excelling in natural scene image segmentation[5].

Wang et al. proposed HRNet (High-Resolution Network), which maintains high-resolution feature maps to enhance the fineness and accuracy of image segmentation. This method is especially suitable for tasks requiring fine segmentation, such as facial recognition and medical image segmentation[17].

Xie et al. developed SegFormer, a Transformer-based model that leverages the global feature extraction capabilities of Transformers, significantly improving segmentation performance, especially in handling complex backgrounds and diverse scenes[18].

Cao et al. introduced Swin-UNet, which combines the strengths of Swin Transformer and U-Net through multi-scale feature fusion and global information capture, achieving high-precision image segmentation. It demonstrates outstanding performance in both medical and remote sensing image segmentation[3].

2.2 Diffusion Model

Diffusion models are generative models based on stochastic processes, initially used to simulate diffusion processes in physical phenomena. In recent years, diffusion models have been introduced into the field of machine learning, demonstrating excellent performance, particularly in image generation tasks. These models achieve the transformation from simple noise data to complex data structures by gradually introducing noise and then progressively removing it during the generation process.

The introduction of diffusion models has opened new research directions in image processing, offering new possibilities for handling complex image data.

Sohl-Dickstein et al. first proposed the theories of the Diffusion model, which uses the Markov chain inverse process to denoise and recover data[14].

Ho et al. proposed the DDPM (Denoising Diffusion Probabilistic Models) algorithm, which introduces the diffusion model into the field of image generation[7].

Song et al. subsequently proposed the DDIM (Denoising Diffusion Implicit Models) algorithm, using a non-Markov-chain method to achieve skip-step sampling substantially and improve image generation speed[15].

These advancements provide the theoretical foundation and technical support for applying such models in image segmentation tasks.

3 Background

The traditional diffusion models DDPM and DDIM both consist of forward and backward processes.

The forward process is mainly used to corrupt the data by adding noise, which can be described as follows: For given original data x₀, Gaussian noise is added step by step as Eq.1:

\begin{equation}x_{t} = \sqrt {1 - \beta _t} x_{t-1} + \sqrt {\beta _t} \epsilon _t\end{equation}

(1)

where x_{t − 1} represents the image at the previous time step, i.e., the image before noise is added at the current time step, x_t represents the image after noise is added at the current time step, and β_t represents the variance used at the current time step. By simplification, the noisy data at time step t can be directly calculated in Eq.2 and Eq.3:

\begin{equation}x_t = \sqrt {\bar{\alpha }_t} x_0 + \sqrt {1 - \bar{\alpha }_t} \epsilon\end{equation}

(2)

\begin{equation}\bar{\alpha }_t = \prod _{i=1}^{t} (1 - \beta _i)\end{equation}

(3)

where α_t = 1 − β_t and \(\bar{\alpha }_t = \prod _{i=1}^{t} \alpha _i\). x_t can be viewed as a linear combination of the original data x₀ and standard Gaussian noise ϵ. When t is large, x_t approximates standard Gaussian noise. At any point in the noise addition process, x_t can be directly derived from the original data x₀.

The backward process is mainly used to recover the data by predicting and removing noise through a parameterized model ϵ_θ(x_t, t).

In the traditional DDPM algorithm, the backward process can be described by Eq.4:

\begin{equation}x_{t-1} = \frac{1}{\sqrt {1 - \beta _t}} \left(x_t - \frac{\beta _t}{\sqrt {1 - \bar{\alpha }_t}} \epsilon _\theta (x_t, t) \right) + \sigma _t z\end{equation}

(4)

where σ_t represents the noise scale at time step t, and z is Gaussian noise, then, we have Eq.5:

\begin{equation}x_{t-1} \approx \frac{1}{\sqrt {\alpha _t}} \left(x_t - \frac{\beta _t}{\sqrt {1 - \bar{\alpha }_t}} \epsilon _\theta (x_t, t) \right)\end{equation}

(5)

In the traditional DDIM algorithm, the backward process can be described by Eq.6:

\begin{equation}x_{t-\tau } = \sqrt {\bar{\alpha }_{t-\tau }} \left(\frac{x_t - \sqrt {1 - \bar{\alpha }_t} \epsilon _\theta (x_t, t)}{\sqrt {\bar{\alpha }_t}} \right) + \sqrt {1 - \bar{\alpha }_{t-\tau } - \sigma _t^2} \epsilon _\theta (x_t, t) + \sigma _t z\end{equation}

(6)

then, we have Eq.7:

\begin{equation}x_{t-\tau } \approx \sqrt {\bar{\alpha }_{t-\tau }} \left(\frac{x_t - \sqrt {1 - \bar{\alpha }_t} \epsilon _\theta (x_t, t)}{\sqrt {\bar{\alpha }_t}} \right) + \sqrt {1 - \bar{\alpha }_{t-\tau }} \epsilon _\theta (x_t, t)\end{equation}

(7)

where τ represents the step lenth.

4 TPSegmentDiff Method

We propose the TPSegmentDiff method based on the traditional DDPM and DDIM models. Using tactile paving optical images as prior knowledge, the trained TPSegmentDiff is capable of generating high-precision segmentation masks from random noise to achieve tactile paving segmentation.

Table 1:

Input: Image x₀, ground truth gt, model parameters θ
Output: Trained model parameters θ
Progress:
1: while θ is not converged do:
2: Get random t ∈ (0, 1000)
3: Generate random noise ε_t ∼ N(0, I)
4: Let \(x _ { t } = \sqrt { \overline{ \alpha } _ { t } } x _ { 0 } + \sqrt { 1 - \overline{ \alpha } _ { t } } \varepsilon _ { t } , x _ { i n , t } = x _ { t } \oplus g t\)
5: t, x_{in, t} → model(θ) → ε_{out, t}
6: Compute loss(ε_t, ε_{out, t}) and update parameters θ
7: return θ

Table 1: TPSegmentdiff Training Algorithm

4.1 TPSegmentDiff Training

Table 1 presents the TPSegmentDiff training algorithm. In traditional diffusion generative models, there’s a U-Net to predict noise which is typically set with three input channels and three output channels to predict noise for all channels, thereby generating three-channel images. In this study, in order to generate a single-channel segmentation mask, we adjusted the U-Net’s channel parameters during model training, setting the input channel c_in = 4 and the output channel c_out = 1. During training, we preprocess the input data by concatenating the optical image, denoted as img with a shape of [3, h, w] and the ground truth segmentation mask, denoted as gt with a shape of [1, h, w] along the channel dimension to obtain x_in. In the concatenating progress we put the segmentation mask gt in the last channel.

In the forward process, the first three channels, img, are kept unchanged as prior knowledge, and only the last channel, gt, is subjected to noise addition. The process is as follows: Gaussian noise ϵ is randomly generated, and a random t is selected. According to Eq.2, noise is added only to the last channel gt for t times to obtain x_{in, t}, which is then input into the noise predict model along with the time step t. The output is the predicted single-channel noise ϵ_out. Then compared predicted noise ϵ_out with the Gaussian noise ϵ used during the noise addition process to calculate the loss, which is used for backpropagation to update the model parameters. Figure 2 visualizes the training algorithm.

Figure 2:

4.2 TPSegmentDiff Segmentation

Table 2 presents the TPSegmentDiff segmentation algorithm, which consists of TPSegmentDiff-DDPM and TPSegmentDiff-DDIM algorithms based on traditional DDPM and DDIM.

Table 2:

Input: Model parameters θ, steps t, random noise x_t, image x_img, DDIM flag, step lenth τ
Output: Segmentation mask x₀
Progress:
1: while t > 0 do
2: \(x _ { i n , t } = x _ { i m g } \oplus x _ { t }\)
3: t, x_{in, t} → model(θ) → ε_t
4: if DDIM: //TPSegmentDiff-DDIM Algorithm
5: \(x _ { x - \tau } = \sqrt { \overline{\alpha _ { x - \tau }}} (x _ { t } - \sqrt { 1 - \overline{ \alpha } _ { t } } \varepsilon _ { t }) / \sqrt { \overline{\alpha }_ { t } } + \sqrt { 1 - \overline{ \alpha _ { x - \tau } }} \varepsilon _ { t }\)
6: t ← t − τ
7: else: //TPSegmentDiff-DDPM Algorithm
8: \(x _ { t - 1 } = \left[ x _ { t } - (1 - \alpha _ { t } / \sqrt { 1 - \overline{ \alpha } _ { t } }) \right] / \sqrt { \alpha _ { t } } + \sqrt { 1 - \alpha _ { t } } z , z \sim N (0 , 1)\)
9: t ← t − 1
10: return x₀

Table 2: TPSegmentDiff Segmentation Algorithm

TPSegmentDiff-DDPM: Randomly generate Gaussian noise ϵ_t with a shape of [1, h, w], set x_{in, t} = img⊕ϵ_t, and input it into the noise predict model to obtain ϵ_pred. Then, according to Eq.5, calculate ϵ_{t − 1}. Next, set x_{in, t − 1} = img⊕ϵ_{t − 1} and input it into the noise predict model. This process is iteratively repeated until, after a sufficient number of iterations, ϵ₀ is obtained as the segmentation mask.

TPSegmentDiff-DDIM: First, specify a step length τ. Randomly generate Gaussian noise ϵ_t with a shape of [1, h, w], set x_{in, t} = img⊕ϵ_t, and input it into the noise predict model to obtain ϵ_pred. Then, according to Eq.7, calculate ϵ_{t − τ}. Next, set x_{in, t − τ} = img⊕ϵ_{t − τ} and input it into the noise predict model. This process is iteratively repeated until, after a sufficient number of iterations, ϵ₀ is obtained as the segmentation mask. Figure 3 visualizes the segmentation algorithm.

Figure 3:

4.3 TPSegmentDiff Voting Mechanism

Since the generation results of diffusion models are correlated with the initial noise distribution, there will be some random error in tactile paving image segmentation. To reduce random errors, we introduced a voting mechanism for generating segmentation masks multiple times and deciding the final result. In generated segmentation mask, each pixel are ranged from 0 to 255, with the closer to 255 indicating the more likely to be blind, and conversely the closer to 0 indicating the more likely to be background. For each pixel i, we compute the vote as Eq.8:

\begin{equation}p r e d _ { v o t e } \left[ i \right] = \frac{ \sum _ { k = 1 } ^ { n } p r e d _ { k } \left[ i \right] }{ n }\end{equation}

(8)

where pred_vote[i] represents the i-th pixel in voting result, pred_k[i] represents the i-th pixel in the k-th generated segmentation mask, n represents the number of the generated segmentation mask used in the voting machanism.

Then we use the Eq,9 to decide the final result of the segmentation mask:

\begin{equation}pred_{\text{final}}[i] = \left\lbrace \begin{matrix} 1, & \text{if } pred_{\text{vote}}[i] \ge 128 \\ 0, & \text{otherwise}\end{matrix} \right.\end{equation}

(9)

Equation 9 combines the prediction results of each generated segmentation mask to determine the final segmentation mask, thus reducing the error enhances the robustness of the noise predict model, and at the same time converting the discrete 0-255 segmentation mask into a 0-1 binary segmentation mask.

5 Experiments

5.1 Model Training

We used the TP-Dataset[19] to train the model from scratch. The dataset is a tactile paving image segmentation dataset, including 1,391 tactile paving images from various scenes such as campuses, streets, railway stations, bus stations, subways, communities, and hospitals. The dataset contains RGB three-channel tactile paving images in JPG format and corresponding grayscale single-channel label mask images in PNG format. This dataset has two image segmentation categories: background (marked with 0) and tactile paving (marked with 255). The images have various resolutions with different aspect ratios.

We conducted model training and experiments on a desktop computer equipped with an Intel i9-13900KF, 63.9 GB RAM, and an Nvidia GeForce RTX 4080 16G GPU. During model training, we uniformly scaled the training images to 128 pixels on the longest side, centered them, and then filled the surrounding area with black pixels to make them 128x128 pixels. The learning rate was fixed at 1e-4, and we did not use half-precision (FP16). The batch size was set to 8, and training was conducted on the described experimental platform. A total of 3.4e5 steps were performed in the training, which took about 50 hours.

5.2 Experimental analysis

In order to test the performance of TPSegmentDiff under different conditions, we conducted in-depth experiments.

All TPSegmentDiff-DDPM algorithms were set to sample 1000 times to generate segmentation results in experiments. For the TPSegmentDiff-DDIM algorithm, unless otherwise specified, a step length of 25 was used, with 40 samples taken, which is equivalent to 1000 sampling steps, aligning with the TPSegmentDiff-DDPM’s sampling process. The performance metrics used for evaluation are Mean Intersection over Union (MIoU) and F1-score.

Table 3:

Algorithm	Pic Time	FPS	Step Time
TPSegmentDiff-DDPM	14.691	0.068	0.0146
TPSegmentDiff-DDIM	0.590	1.693	0.0147

Table 3: Speed comparison of TPSegmentDiff-DDPM and TPSegmentDiff-DDIM

In Table 3 pic, time represents the time taken to segment one Image, step time represents the time taken in one denoise step. The result indicates that the step time of TPSegmentDiff-DDIM is close to that of TPSegmentDiff-DDPM, suggesting that adopting TPSegmentDiff-DDIM does not introduce significant additional computational overhead. By applying the TPSegmentDiff-DDIM algorithm and setting a more significant step size, the generation speed of the image segmentation mask can be significantly improved.

Figure 4:

Figure 4 compares the performance of TPSegmentDiff-DDPM and TPSegmentDiff-DDIM, on the task of tactile paving image segmentation. It shows results for both models with and without a voting mechanism.

Without voting, TPSegmentDiff-DDPM achieves an MIoU of 0.584 and an F1-score of 0.672. When the voting mechanism is applied, TPSegmentDiff-DDPM’s performance improves significantly, with an MIoU of 0.632 and an F1-score of 0.711. In contrast, TPSegmentDiff- DDIM shows an MIoU of 0.665 and an F1-score of 0.731 without voting, and these metrics remain unchanged when the voting mechanism is applied. This indicates that while the voting mechanism enhances the performance of TPSegmentDiff-DDPM by reducing random errors and increasing segmentation accuracy, TPSegmentDiff- DDIM already exhibits robust performance without the need for additional voting, highlighting its inherent stability and efficiency in this segmentation task.

Figure 5:

Figure 5 illustrates the impact of varying the step length in the TPSegmentDiff-DDIM on the model’s performance, measured by MIoU and F1-score. The experimental setup for TPSegmentDiff-DDIM involves varying the step length and sample steps to keep them equivalent to 1000 sampling steps to compare them to TPSeg-TPSegmentDiff-DDPM with 1000 sampling steps.

As observed, when the step length is small, such as 10, the performance metrics, MIoU and F1-score, remain high, with values of approximately 0.666 and 0.732, respectively. This indicates the model effectively captures the data’s underlying distribution with frequent updates. As the step length increases to 20 and 25, the performance remains relatively stable, with minor fluctuations, suggesting a balanced trade-off between computational efficiency and model accuracy.

However, when the step length is extended beyond 50 to 100 and 200, a notable decline in both MIoU and F1 scores is observed. The MIoU drops from 0.660 at a step length of 50 to 0.638 at 100 and drastically to 0.416 at 200. Similarly, the F1-score decreases from 0.721 to 0.438 as the step length increases. This decline highlights the limitations of using giant steps, resulting in fewer samples and less accurate approximations of the target distribution, ultimately impairing the model’s ability to generate precise segmentation outputs.

The analysis suggests that while more considerable step lengths can reduce computational overhead, they may compromise the model’s performance in capturing fine-grained details necessary for high-quality segmentation. Consequently, selecting an optimal step length is crucial for balancing computational efficiency with segmentation accuracy.

In general, the TPSegmentDiff-DDIM algorithm’s segmentation speed and effect are best when the step length is set to 25. After a series of experimental verifications, the TPSegmentDiff algorithm performs better in the tactile paving image segmentation task.

6 Summary

This study adapts diffusion models to tactile paving image segmentation, significantly improving the model’s segmentation accuracy and robustness in complex environments. This advancement provides a new approach to tactile paving image segmentation. We have demonstrated the potential of diffusion models in handling high-variability and noise-interfered image segmentation tasks. We also introduced a voting mechanism to reduce the random error caused by the random distribution of the initial noise. The study offers theoretical support and practical evidence for the future application of diffusion models in various scenarios, such as pedestrian and vehicle image segmentation tasks, to aid autonomous vehicles and mobile robots.

Future work needs to optimize the use of prior knowledge, ensuring that the method can more effectively extract valuable information from prior knowledge during training while avoiding overfitting caused by repetitive use of prior knowledge during segmentation. Additionally, techniques such as distillation and pruning can reduce model complexity and further enhance segmentation speed.

References

[1]

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (Dec 2017), 2481–2495.

Abstract

1 Introduction

2 Related work

2.1 Image Segmentation

2.2 Diffusion Model

3 Background

4 TPSegmentDiff Method

4.1 TPSegmentDiff Training

4.2 TPSegmentDiff Segmentation

4.3 TPSegmentDiff Voting Mechanism

5 Experiments

5.1 Model Training

5.2 Experimental analysis

6 Summary

References

Index Terms

Recommendations

Multiregion Multiscale Image Segmentation with Anisotropic Diffusion

DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model

Normalized Gradient Vector Diffusion and Image Segmentation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations