End-to-End Learnable Multi-Scale Feature Compression for VCM

Kim, Yeongwoong; Jeong, Hyewon; Yu, Janghyun; Kim, Younhee; Lee, Jooyoung; Jeong, Se Yoon; Kim, Hui Yong

doi:10.1109/TCSVT.2023.3302858

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.16670 (cs)

[Submitted on 29 Jun 2023 (v1), last revised 8 Aug 2023 (this version, v3)]

Title:End-to-End Learnable Multi-Scale Feature Compression for VCM

Authors:Yeongwoong Kim, Hyewon Jeong, Janghyun Yu, Younhee Kim, Jooyoung Lee, Se Yoon Jeong, Hui Yong Kim

View PDF

Abstract:The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has $\times5$ to $\times27$ times less encoding time for object detection...

Comments:	13 pages, accepted by IEEE Transactions on Circuits and Systems for Video Technology
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Cite as:	arXiv:2306.16670 [cs.CV]
	(or arXiv:2306.16670v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.16670
Related DOI:	https://doi.org/10.1109/TCSVT.2023.3302858

Submission history

From: Yeongwoong Kim [view email]
[v1] Thu, 29 Jun 2023 04:05:13 UTC (5,262 KB)
[v2] Sun, 16 Jul 2023 19:50:49 UTC (5,305 KB)
[v3] Tue, 8 Aug 2023 05:00:58 UTC (5,285 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:End-to-End Learnable Multi-Scale Feature Compression for VCM

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:End-to-End Learnable Multi-Scale Feature Compression for VCM

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators