Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–8 of 8 results for author: Overwijk, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2307.00342  [pdf, other

    cs.CL cs.IR

    Improving Multitask Retrieval by Promoting Task Specialization

    Authors: Wenzheng Zhang, Chenyan Xiong, Karl Stratos, Arnold Overwijk

    Abstract: In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval in which a separate retriever is trained for each task. We show that it is possible to train a multitask retriever that outperforms task-specific retrievers by promoting task specialization. The main ingr… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

    Comments: TACL 2023

  2. arXiv:2302.03754  [pdf, other

    cs.CL

    Augmenting Zero-Shot Dense Retrievers with Plug-in Mixture-of-Memories

    Authors: Suyu Ge, Chenyan Xiong, Corby Rosset, Arnold Overwijk, Jiawei Han, Paul Bennett

    Abstract: In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora ("external memories"), with the option to "plug in" new memory at inference time. We develop a joint learning mechanism that trains the augmentation component with latent labels derived from t… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  3. arXiv:2211.15848  [pdf, other

    cs.IR cs.AI cs.CL

    ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

    Authors: Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie Callan

    Abstract: ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the C… ▽ More

    Submitted 1 December, 2022; v1 submitted 28 November, 2022; originally announced November 2022.

  4. arXiv:2210.17167  [pdf, other

    cs.CL

    Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives

    Authors: Si Sun, Chenyan Xiong, Yue Yu, Arnold Overwijk, Zhiyuan Liu, Jie Bao

    Abstract: In this paper, we investigate the instability in the standard dense retrieval training, which iterates between model training and hard negative selection using the being-trained model. We show the catastrophic forgetting phenomena behind the training instability, where models learn and forget different negative groups during training iterations. We then propose ANCE-Tele, which accumulates momentu… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 main conference

  5. arXiv:2210.15212  [pdf, other

    cs.CL cs.IR cs.LG

    COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning

    Authors: Yue Yu, Chenyan Xiong, Si Sun, Chao Zhang, Arnold Overwijk

    Abstract: We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact of document differences, COCO-DR continues pretraining the language model on the target corpora to adapt the model to target distributions via COtinuous COtrastiv… ▽ More

    Submitted 24 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 (Main Conference). The code and Model can be found at https://github.com/OpenMatch/COCO-DR

    Journal ref: EMNLP 2022

  6. arXiv:2102.09206  [pdf, other

    cs.LG

    Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder

    Authors: Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tieyan Liu, Arnold Overwijk

    Abstract: Dense retrieval requires high-quality text sequence embeddings to support effective search in the representation space. Autoencoder-based language models are appealing in dense retrieval as they train the encoder to output high-quality embedding that can reconstruct the input texts. However, in this paper, we provide theoretical analyses and show empirically that an autoencoder language model with… ▽ More

    Submitted 16 September, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

  7. arXiv:2007.00808  [pdf, other

    cs.IR cs.CL cs.LG

    Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

    Authors: Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk

    Abstract: Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in t… ▽ More

    Submitted 20 October, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

  8. arXiv:1911.02671  [pdf, other

    cs.CL cs.IR

    Open Domain Web Keyphrase Extraction Beyond Language Modeling

    Authors: Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, Arnold Overwijk

    Abstract: This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase… ▽ More

    Submitted 6 November, 2019; originally announced November 2019.

    Journal ref: EMNLP-IJCNLP 2019