Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–18 of 18 results for author: Majumder, R

Searching in archive cs. Search in all archives.
.
  1. MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

    Authors: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik , et al. (6 additional authors not shown)

    Abstract: Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of down… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: 10 pages, 6 figures, for associated dataset, see http://github.com/microsoft/MS-MARCO-Web-Search

  2. arXiv:2402.05672  [pdf, other

    cs.CL cs.IR

    Multilingual E5 Text Embeddings: A Technical Report

    Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei

    Abstract: This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pr… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: 6 pages

  3. arXiv:2401.00368  [pdf, other

    cs.CL cs.IR

    Improving Text Embeddings with Large Language Models

    Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei

    Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelin… ▽ More

    Submitted 31 May, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

    Comments: Accepted by ACL 2024

  4. arXiv:2312.10879  [pdf

    cs.LG cs.AI

    Development and Evaluation of Ensemble Learning-based Environmental Methane Detection and Intensity Prediction Models

    Authors: Reek Majumder, Jacquan Pollard, M Sabbir Salek, David Werth, Gurcan Comert, Adrian Gale, Sakib Mahmud Khan, Samuel Darko, Mashrur Chowdhury

    Abstract: The environmental impacts of global warming driven by methane (CH4) emissions have catalyzed significant research initiatives in developing novel technologies that enable proactive and rapid detection of CH4. Several data-driven machine learning (ML) models were tested to determine how well they identified fugitive CH4 and its related intensity in the affected areas. Various meteorological charact… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

  5. arXiv:2310.14587  [pdf, other

    cs.IR cs.CL

    Large Search Model: Redefining Search Stack in the Era of LLMs

    Authors: Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei

    Abstract: Modern search engines are built on a stack of different components, including query understanding, retrieval, multi-stage ranking, and question answering, among others. These components are often optimized and deployed independently. In this paper, we introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one la… ▽ More

    Submitted 2 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: SIGIR Forum, Vol. 57 No. 2 - December 2023

  6. arXiv:2309.01598  [pdf, ps, other

    cs.SI

    Exploring Diverse Coping Mechanisms in 2023: A Comprehensive Survey Across Backgrounds and Cultures

    Authors: Abhijit Paul, Rony Majumder

    Abstract: This study presents a pioneering investigation into the wide array of coping mechanisms employed by individuals in the year 2023, with a focus on data collected through the popular social media platform TikTok. Coping mechanisms are essential strategies that people adopt to navigate the challenges and stressors of everyday life, yet little research has been conducted on their comprehensive compila… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

  7. arXiv:2304.04487  [pdf, other

    cs.CL cs.AI

    Inference with Reference: Lossless Acceleration of Large Language Models

    Authors: Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei

    Abstract: We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios (e.g., retrieved documents). LLMA first selects a text span from the reference and copies its tokens t… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

    Comments: 9 pages

  8. arXiv:2212.05225  [pdf, other

    cs.IR cs.CL

    LEAD: Liberal Feature-based Distillation for Dense Retrieval

    Authors: Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jingwen Lu, Yan Zhang, Linjun Yang, Rangan Majumder, Nan Duan

    Abstract: Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model. Traditional methods include response-based methods and feature-based methods. Response-based methods are widely used but suffer from lower upper limits of performance due to their ignorance of intermediate signals, while feature-based methods have constraints on vocabularies,… ▽ More

    Submitted 11 December, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: Accepted by WSDM 2024

  9. arXiv:2212.03533  [pdf, other

    cs.CL cs.IR

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Authors: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

    Abstract: This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clu… ▽ More

    Submitted 22 February, 2024; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: 17 pages, v2 fixes the SummEval numbers

  10. arXiv:2210.11773  [pdf, other

    cs.CL cs.IR

    SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval

    Authors: Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan, Weizhu Chen

    Abstract: Sampling proper negatives from a large document pool is vital to effectively train a dense retrieval model. However, existing negative sampling strategies suffer from the uninformative or false negative problem. In this work, we empirically show that according to the measured relevance scores, the negatives ranked around the positives are generally more informative and less likely to be false nega… ▽ More

    Submitted 24 October, 2022; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: 12 pages, accepted by EMNLP 2022

  11. arXiv:2209.13335  [pdf, other

    cs.IR cs.CL

    PROD: Progressive Distillation for Dense Retrieval

    Authors: Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan

    Abstract: Knowledge distillation is an effective way to transfer knowledge from a strong teacher to an efficient student model. Ideally, we expect the better the teacher is, the better the student. However, this expectation does not always come true. It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. To bridge the gap,… ▽ More

    Submitted 24 June, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

    Comments: Accepted by WWW2023

  12. arXiv:2207.02578  [pdf, other

    cs.IR

    SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

    Authors: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

    Abstract: In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve th… ▽ More

    Submitted 12 May, 2023; v1 submitted 6 July, 2022; originally announced July 2022.

    Comments: Accepted to ACL 2023

  13. arXiv:2203.08221  [pdf, other

    eess.SY cs.AI

    Development of Decision Support System for Effective COVID-19 Management

    Authors: shuvrangshu Jana, Rudrashis Majumder, Aashay Bhise, Nobin Paul, Stuti Garg, Debasish Ghose

    Abstract: This paper discusses a Decision Support System (DSS) for cases prediction, allocation of resources, and lockdown management for managing COVID-19 at different levels of a government authority. Algorithms incorporated in the DSS are based on a data-driven modeling approach and independent of physical parameters of the region, and hence the proposed DSS is applicable to any area. Based on predicted… ▽ More

    Submitted 12 March, 2022; originally announced March 2022.

    Comments: 5th world Congress on Disaster Management, IIT Delhi, New Delhi, India

  14. arXiv:2112.01921  [pdf

    cs.LG eess.SY physics.data-an physics.ins-det

    In situ process quality monitoring and defect detection for direct metal laser melting

    Authors: Sarah Felix, Saikat Ray Majumder, H. Kirk Mathews, Michael Lexa, Gabriel Lipsa, Xiaohu Ping, Subhrajit Roychowdhury, Thomas Spears

    Abstract: Quality control and quality assurance are challenges in Direct Metal Laser Melting (DMLM). Intermittent machine diagnostics and downstream part inspections catch problems after undue cost has been incurred processing defective parts. In this paper we demonstrate two methodologies for in-process fault detection and part quality prediction that can be readily deployed on existing commercial DMLM sys… ▽ More

    Submitted 3 December, 2021; originally announced December 2021.

    Comments: 16 pages, 4 figures

    Journal ref: Sci Rep 12, 8503 (2022)

  15. arXiv:2112.01439  [pdf, other

    cs.GT

    Game-Theoretic Model Based Resource Allocation During Floods

    Authors: Rudrashis Majumder, Rakesh R Warier, Debasish Ghose

    Abstract: For multiple emergencies caused by natural disasters, it is crucial to allocate resources equitably to each emergency location, especially when the availability of resources is limited in quantity. This paper has developed a multi-event crisis management system using a non-cooperative, complete information, strategic form game model. In the proposed system, each emergency event is assumed to occur… ▽ More

    Submitted 2 December, 2021; originally announced December 2021.

    Comments: Presented in 4th World Congress on Disaster Management, IIT Bombay, Mumbai, India, 29 Jan-01 Feb 2019 (11 pages)

  16. arXiv:2108.01125  [pdf

    quant-ph cs.CR cs.LG

    Hybrid Classical-Quantum Deep Learning Models for Autonomous Vehicle Traffic Image Classification Under Adversarial Attack

    Authors: Reek Majumder, Sakib Mahmud Khan, Fahim Ahmed, Zadid Khan, Frank Ngeni, Gurcan Comert, Judith Mwakalonge, Dimitra Michalaka, Mashrur Chowdhury

    Abstract: Image classification must work for autonomous vehicles (AV) operating on public roads, and actions performed based on image misclassification can have serious consequences. Traffic sign images can be misclassified by an adversarial attack on machine learning models used by AVs for traffic sign recognition. To make classification models resilient against adversarial attacks, we used a hybrid deep-l… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: 16 pages, 7 figures

  17. arXiv:2004.01401  [pdf, ps, other

    cs.CL

    XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

    Authors: Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, Ming Zhou

    Abstract: In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it pr… ▽ More

    Submitted 22 May, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

  18. arXiv:1611.09268  [pdf, other

    cs.CL cs.IR

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Authors: Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang

    Abstract: We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that… ▽ More

    Submitted 31 October, 2018; v1 submitted 28 November, 2016; originally announced November 2016.