Unsupervised Document Embedding via Contrastive Augmentation

Luo, Dongsheng; Cheng, Wei; Ni, Jingchao; Yu, Wenchao; Zhang, Xuchao; Zong, Bo; Liu, Yanchi; Chen, Zhengzhang; Song, Dongjin; Chen, Haifeng; Zhang, Xiang

Computer Science > Computation and Language

arXiv:2103.14542v1 (cs)

[Submitted on 26 Mar 2021]

Title:Unsupervised Document Embedding via Contrastive Augmentation

Authors:Dongsheng Luo, Wei Cheng, Jingchao Ni, Wenchao Yu, Xuchao Zhang, Bo Zong, Yanchi Liu, Zhengzhang Chen, Dongjin Song, Haifeng Chen, Xiang Zhang

View PDF

Abstract:We present a contrasting learning approach with data augmentation techniques to learn document representations in an unsupervised manner. Inspired by recent contrastive self-supervised learning algorithms used for image and NLP pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases that preserve the semantics of the original document. With different backbones and contrastive learning frameworks, our study reveals the enormous benefits of contrastive augmentation for document representation learning with two additional insights: 1) including data augmentation in a contrastive way can substantially improve the embedding quality in unsupervised document representation learning, and 2) in general, stochastic augmentations generated by simple word-level manipulation work much better than sentence-level and document-level ones. We plug our method into a classifier and compare it with a broad range of baseline methods on six benchmark datasets. Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.

Comments:	13 pages; under review
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2103.14542 [cs.CL]
	(or arXiv:2103.14542v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2103.14542

Submission history

From: Dongsheng Luo [view email]
[v1] Fri, 26 Mar 2021 15:48:52 UTC (764 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-03

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Dongsheng Luo
Wei Cheng
Jingchao Ni
Wenchao Yu
Xuchao Zhang

…

export BibTeX citation

Computer Science > Computation and Language

Title:Unsupervised Document Embedding via Contrastive Augmentation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unsupervised Document Embedding via Contrastive Augmentation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators