Extracting Text Representations for Terms and Phrases in Technical Domains

Fusco, Francesco; Antognini, Diego

Computer Science > Computation and Language

arXiv:2305.15867 (cs)

[Submitted on 25 May 2023]

Title:Extracting Text Representations for Terms and Phrases in Technical Domains

Authors:Francesco Fusco, Diego Antognini

View PDF

Abstract:Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast to static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.

Comments:	Accepted at ACL 2023 (industry). 10 pages, 3 figures, 5 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2305.15867 [cs.CL]
	(or arXiv:2305.15867v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.15867

Submission history

From: Francesco Fusco [view email]
[v1] Thu, 25 May 2023 08:59:36 UTC (674 KB)

Computer Science > Computation and Language

Title:Extracting Text Representations for Terms and Phrases in Technical Domains

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Extracting Text Representations for Terms and Phrases in Technical Domains

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators