Pre-trained Language Model for Web-scale Retrieval in Baidu Search

Liu, Yiding; Huang, Guan; Liu, Jiaxiang; Lu, Weixue; Cheng, Suqi; Li, Yukun; Shi, Daiting; Wang, Shuaiqiang; Cheng, Zhicong; Yin, Dawei

Computer Science > Information Retrieval

arXiv:2106.03373 (cs)

[Submitted on 7 Jun 2021 (v1), last revised 16 Oct 2021 (this version, v4)]

Title:Pre-trained Language Model for Web-scale Retrieval in Baidu Search

Authors:Yiding Liu, Guan Huang, Jiaxiang Liu, Weixue Lu, Suqi Cheng, Yukun Li, Daiting Shi, Shuaiqiang Wang, Zhicong Cheng, Dawei Yin

View PDF

Abstract:Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed in Baidu Search. The system exploits the recent state-of-the-art Chinese pretrained language model, namely Enhanced Representation through kNowledge IntEgration (ERNIE), which facilitates the system with expressive semantic matching. In particular, we developed an ERNIE-based retrieval model, which is equipped with 1) expressive Transformer-based semantic encoders, and 2) a comprehensive multi-stage training paradigm. More importantly, we present a practical system workflow for deploying the model in web-scale retrieval. Eventually, the system is fully deployed into production, where rigorous offline and online experiments were conducted. The results show that the system can perform high-quality candidate retrieval, especially for those tail queries with uncommon demands. Overall, the new retrieval system facilitated by pretrained language model (i.e., ERNIE) can largely improve the usability and applicability of our search engine.

Comments:	Accepted by KDD 2021
Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:2106.03373 [cs.IR]
	(or arXiv:2106.03373v4 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2106.03373

Submission history

From: Yiding Liu Dr. [view email]
[v1] Mon, 7 Jun 2021 06:55:45 UTC (517 KB)
[v2] Fri, 25 Jun 2021 13:32:13 UTC (518 KB)
[v3] Wed, 30 Jun 2021 05:38:58 UTC (518 KB)
[v4] Sat, 16 Oct 2021 15:12:57 UTC (518 KB)

Computer Science > Information Retrieval

Title:Pre-trained Language Model for Web-scale Retrieval in Baidu Search

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Pre-trained Language Model for Web-scale Retrieval in Baidu Search

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators