Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-Training and Multi-Modal Tokens | IEEE Conference Publication | IEEE Xplore
Location via proxy:
[ UP ]
[Report a bug]
[Manage cookies]
No cookies
No scripts
No ads
No referrer
Show this form