Pedestrian detection is an important basis for many pedestrian-related applications and studies, and has received extensive attention in recent years. The end-to-end DEtection TRansformer (DETR) is a method that avoids the manual design of components and achieves better results than convolutional neural networks in general object detection. Inspired by this, we present the Improved Deformable-DETR for crowd Pedestrian Detection (IDPD). First, we propose a dynamic neck, specifically, one that uses omni-dimensional dynamic convolution to change the number of channels in the neck feature maps, to alleviate the problem of pedestrian information loss caused by the reduction in the number of channels in the feature maps. Second, we design a hybrid decoding loss that incorporates one-to-one Hungarian matching loss, one-to-many Hungarian matching auxiliary loss, and reconstruction loss for reconstructing full-body boxes from noisy visible part boxes based on contrastive denoising method, to tackle the slow convergence issue in Deformable-DETR for crowd pedestrian detection caused by the more serious positive and negative sample imbalance and unstable bipartite map matching problems. IDPD was experimentally evaluated on the CrowdHuman validation dataset. When using ResNet-50 as the backbone network, it obtains the results of 93.22% AP, 39.22% MR\(^{-2}\), and 85.02% JI, outperforming the Deformable-DETR baseline and surpassing CNN-based models. Furthermore, even better results are obtained (94.16% AP, 37.05% MR\(^{-2}\), and 86.07% JI) when using Swin-T as the backbone network.
