Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Cheng, Ning; Li, You; Gao, Jing; Fang, Bin; Xu, Jinan; Han, Wenjuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.09813 (cs)

[Submitted on 14 Mar 2024 (v1), last revised 17 Jun 2024 (this version, v3)]

Title:Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Authors:Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han

View PDF HTML (experimental)

Abstract:Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: this https URL.

Comments:	Accepted by ICIC 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:	arXiv:2403.09813 [cs.CV]
	(or arXiv:2403.09813v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.09813

Submission history

From: Ning Cheng [view email]
[v1] Thu, 14 Mar 2024 19:01:54 UTC (8,446 KB)
[v2] Fri, 10 May 2024 12:12:30 UTC (1,105 KB)
[v3] Mon, 17 Jun 2024 13:19:26 UTC (1,105 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators