-
DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs
Authors:
Venktesh V. Deepali Prabhu,
Avishek Anand
Abstract:
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain…
▽ More
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at https://github.com/VenkteshV/DEXTER
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Keystroke Dynamics Against Academic Dishonesty in the Age of LLMs
Authors:
Debnath Kundu,
Atharva Mehta,
Rajesh Kumar,
Naman Lal,
Avinash Anand,
Apoorv Singh,
Rajiv Ratn Shah
Abstract:
The transition to online examinations and assignments raises significant concerns about academic integrity. Traditional plagiarism detection systems often struggle to identify instances of intelligent cheating, particularly when students utilize advanced generative AI tools to craft their responses. This study proposes a keystroke dynamics-based method to differentiate between bona fide and assist…
▽ More
The transition to online examinations and assignments raises significant concerns about academic integrity. Traditional plagiarism detection systems often struggle to identify instances of intelligent cheating, particularly when students utilize advanced generative AI tools to craft their responses. This study proposes a keystroke dynamics-based method to differentiate between bona fide and assisted writing within academic contexts. To facilitate this, a dataset was developed to capture the keystroke patterns of individuals engaged in writing tasks, both with and without the assistance of generative AI. The detector, trained using a modified TypeNet architecture, achieved accuracies ranging from 74.98% to 85.72% in condition-specific scenarios and from 52.24% to 80.54% in condition-agnostic scenarios. The findings highlight significant differences in keystroke dynamics between genuine and assisted writing. The outcomes of this study enhance our understanding of how users interact with generative AI and have implications for improving the reliability of digital educational platforms.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Advances in perovskite nanocrystals and nanocomposites for scintillation applications
Authors:
Abhinav Anand,
Matteo L. Zaffalon,
Andrea Erroi,
Francesca Cova,
Francesco Carulli,
Sergio Brovelli
Abstract:
In recent years, the field of radiation detection has witnessed a paradigm shift with the emergence of plastic scintillators incorporating perovskite nanocrystals (PNCs). This innovative class of scintillators not only capitalizes on the superior luminescent properties of PNCs but also harnesses the flexibility and processability of polymers. This review explores the intricate landscape of synthes…
▽ More
In recent years, the field of radiation detection has witnessed a paradigm shift with the emergence of plastic scintillators incorporating perovskite nanocrystals (PNCs). This innovative class of scintillators not only capitalizes on the superior luminescent properties of PNCs but also harnesses the flexibility and processability of polymers. This review explores the intricate landscape of synthesizing and fabricating scintillating PNCs and nanocomposites, delving into the methods employed in their production. From solution-based methods to innovative solid-state approaches, the synthesis of PNCs for scintillators application is explored comprehensively. Furthermore, embedding strategies within polymeric matrices are scrutinized, shedding light on the various techniques utilized to achieve optimal dispersion and compatibility. The evaluation of the final nanocomposites is finally discussed, with a particular emphasis on their scintillating performance and radiation hardness. Through a meticulous exploration of synthesis methodologies, embedding techniques, and performance assessments, this review aims to provide a multilayered understanding of the state-of-the-art in PNCs-based nanoscintillators.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context
Authors:
Ziyi Liu,
Abhishek Anand,
Pei Zhou,
Jen-tse Huang,
Jieyu Zhao
Abstract:
Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs' social intelligence by mapping their ability to understand and manage intentions…
▽ More
Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on simplistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a novel framework, InterIntent, to assess LLMs' social intelligence by mapping their ability to understand and manage intentions in a game setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88\%, their ability to infer the intentions of others is significantly weaker, trailing human performance by 20\%. Additionally, game performance correlates with intention understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role of intention understanding in evaluating LLMs' social intelligence and highlight the potential of using social deduction games as a complex testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within multiplayer games.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
A Critical Study of What Code-LLMs (Do Not) Learn
Authors:
Abhinav Anand,
Shweta Verma,
Krishna Narasimhan,
Mira Mezini
Abstract:
Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidd…
▽ More
Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
ReMI: A Dataset for Reasoning with Multiple Images
Authors:
Mehran Kazemi,
Nishanth Dikkala,
Ankit Anand,
Petar Devic,
Ishita Dasgupta,
Fangyu Liu,
Bahare Fatemi,
Pranjal Awasthi,
Dee Guo,
Sreenivas Gollapudi,
Ahmed Qureshi
Abstract:
With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encom…
▽ More
With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: https://huggingface.co/datasets/mehrankazemi/ReMI.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
End-to-End Argument Mining as Augmented Natural Language Generation
Authors:
Nilmadhab Das,
Vishal Choudhary,
V. Vijaya Saradhi,
Ashish Anand
Abstract:
Argument Mining (AM) is a crucial aspect of computational argumentation, which deals with the identification and extraction of Argumentative Components (ACs) and their corresponding Argumentative Relations (ARs). Most prior works have solved these problems by dividing them into multiple subtasks. And the available end-to-end setups are mostly based on the dependency parsing approach. This work pro…
▽ More
Argument Mining (AM) is a crucial aspect of computational argumentation, which deals with the identification and extraction of Argumentative Components (ACs) and their corresponding Argumentative Relations (ARs). Most prior works have solved these problems by dividing them into multiple subtasks. And the available end-to-end setups are mostly based on the dependency parsing approach. This work proposes a unified end-to-end framework based on a generative paradigm, in which the argumentative structures are framed into label-augmented text, called Augmented Natural Language (ANL). Additionally, we explore the role of different types of markers in solving AM tasks. Through different marker-based fine-tuning strategies, we present an extensive study by integrating marker knowledge into our generative model. The proposed framework achieves competitive results to the state-of-the-art (SoTA) model and outperforms several baselines.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
The LiteBIRD mission to explore cosmic inflation
Authors:
T. Ghigna,
A. Adler,
K. Aizawa,
H. Akamatsu,
R. Akizawa,
E. Allys,
A. Anand,
J. Aumont,
J. Austermann,
S. Azzoni,
C. Baccigalupi,
M. Ballardini,
A. J. Banday,
R. B. Barreiro,
N. Bartolo,
S. Basak,
A. Basyrov,
S. Beckman,
M. Bersanelli,
M. Bortolami,
F. Bouchet,
T. Brinckmann,
P. Campeti,
E. Carinos,
A. Carones
, et al. (134 additional authors not shown)
Abstract:
LiteBIRD, the next-generation cosmic microwave background (CMB) experiment, aims for a launch in Japan's fiscal year 2032, marking a major advancement in the exploration of primordial cosmology and fundamental physics. Orbiting the Sun-Earth Lagrangian point L2, this JAXA-led strategic L-class mission will conduct a comprehensive mapping of the CMB polarization across the entire sky. During its 3-…
▽ More
LiteBIRD, the next-generation cosmic microwave background (CMB) experiment, aims for a launch in Japan's fiscal year 2032, marking a major advancement in the exploration of primordial cosmology and fundamental physics. Orbiting the Sun-Earth Lagrangian point L2, this JAXA-led strategic L-class mission will conduct a comprehensive mapping of the CMB polarization across the entire sky. During its 3-year mission, LiteBIRD will employ three telescopes within 15 unique frequency bands (ranging from 34 through 448 GHz), targeting a sensitivity of 2.2\,$μ$K-arcmin and a resolution of 0.5$^\circ$ at 100\,GHz. Its primary goal is to measure the tensor-to-scalar ratio $r$ with an uncertainty $δr = 0.001$, including systematic errors and margin. If $r \geq 0.01$, LiteBIRD expects to achieve a $>5σ$ detection in the $\ell=$2-10 and $\ell=$11-200 ranges separately, providing crucial insight into the early Universe. We describe LiteBIRD's scientific objectives, the application of systems engineering to mission requirements, the anticipated scientific impact, and the operations and scanning strategies vital to minimizing systematic effects. We will also highlight LiteBIRD's synergies with concurrent CMB projects.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Archetype-Based Redshift Estimation for the Dark Energy Spectroscopic Instrument Survey
Authors:
Abhijeet Anand,
Julien Guy,
Stephen Bailey,
John Moustakas,
J. Aguilar,
S. Ahlen,
A. Bolton,
A. Brodzeller,
D. Brooks,
T. Claybaugh,
S. Cole,
B. Dey,
K. Fanning,
J. Forero-Romero,
E. Gaztañaga,
S. Gontcho A Gontcho,
L. Le Guillou,
G. Gutierrez,
K. Honscheid,
C. Howlett,
S. Juneau,
D. Kirkby,
T. Kisner,
A. Kremin,
A. Lambert
, et al. (24 additional authors not shown)
Abstract:
We present a computationally efficient galaxy archetype-based redshift estimation and spectral classification method for the Dark Energy Survey Instrument (DESI) survey. The DESI survey currently relies on a redshift fitter and spectral classifier using a linear combination of PCA-derived templates, which is very efficient in processing large volumes of DESI spectra within a short time frame. Howe…
▽ More
We present a computationally efficient galaxy archetype-based redshift estimation and spectral classification method for the Dark Energy Survey Instrument (DESI) survey. The DESI survey currently relies on a redshift fitter and spectral classifier using a linear combination of PCA-derived templates, which is very efficient in processing large volumes of DESI spectra within a short time frame. However, this method occasionally yields unphysical model fits for galaxies and fails to adequately absorb calibration errors that may still be occasionally visible in the reduced spectra. Our proposed approach improves upon this existing method by refitting the spectra with carefully generated physical galaxy archetypes combined with additional terms designed to absorb data reduction defects and provide more physical models to the DESI spectra. We test our method on an extensive dataset derived from the survey validation (SV) and Year 1 (Y1) data of DESI. Our findings indicate that the new method delivers marginally better redshift success for SV tiles while reducing catastrophic redshift failure by $10-30\%$. At the same time, results from millions of targets from the main survey show that our model has relatively higher redshift success and purity rates ($0.5-0.8\%$ higher) for galaxy targets while having similar success for QSOs. These improvements also demonstrate that the main DESI redshift pipeline is generally robust. Additionally, it reduces the false positive redshift estimation by $5-40\%$ for sky fibers. We also discuss the generic nature of our method and how it can be extended to other large spectroscopic surveys, along with possible future improvements.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Data-Driven Predictive Control and MPC: Do we achieve optimality?
Authors:
Akhil S Anand,
Shambhuraj Sawant,
Dirk Reinhardt,
Sebastien Gros
Abstract:
In this paper, we explore the interplay between Predictive Control and closed-loop optimality, spanning from Model Predictive Control to Data-Driven Predictive Control. Predictive Control in general relies on some form of prediction scheme on the real system trajectories. However, these predictions may not accurately capture the real system dynamics, for e.g., due to stochasticity, resulting in su…
▽ More
In this paper, we explore the interplay between Predictive Control and closed-loop optimality, spanning from Model Predictive Control to Data-Driven Predictive Control. Predictive Control in general relies on some form of prediction scheme on the real system trajectories. However, these predictions may not accurately capture the real system dynamics, for e.g., due to stochasticity, resulting in sub-optimal control policies. This lack of optimality is a critical issue in case of problems with economic objectives. We address this by providing sufficient conditions on the underlying prediction scheme such that a Predictive Controller can achieve closed-loop optimality. However, these conditions do not readily extend to Data-Driven Predictive Control. In this context of closed-loop optimality, we conclude that the factor distinguishing the approaches within Data-Driven Predictive Control is if they can be cast as a sequential decision-making process or not, rather than the dichotomy of model-based vs. model-free. Furthermore, we show that the conventional approach of improving the prediction accuracy from data may not guarantee optimality.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
The Construction of Large-scale Structure Catalogs for the Dark Energy Spectroscopic Instrument
Authors:
A. J. Ross,
J. Aguilar,
S. Ahlen,
S. Alam,
A. Anand,
S. Bailey,
D. Bianchi,
S. Brieden,
D. Brooks,
E. Burtin,
A. Carnero Rosell,
E. Chaussidon,
T. Claybaugh,
S. Cole,
K. Dawson,
A. de la Macorra,
A. de Mattia,
Arjun Dey,
Biprateep Dey,
P. Doel,
K. Fanning,
S. Ferraro,
J. Ereza,
A. Font-Ribera,
J. E. Forero-Romero
, et al. (59 additional authors not shown)
Abstract:
We present the technical details on how large-scale structure (LSS) catalogs are constructed from redshifts measured from spectra observed by the Dark Energy Spectroscopic Instrument (DESI). The LSS catalogs provide the information needed to determine the relative number density of DESI tracers as a function of redshift and celestial coordinates and, e.g., determine clustering statistics. We produ…
▽ More
We present the technical details on how large-scale structure (LSS) catalogs are constructed from redshifts measured from spectra observed by the Dark Energy Spectroscopic Instrument (DESI). The LSS catalogs provide the information needed to determine the relative number density of DESI tracers as a function of redshift and celestial coordinates and, e.g., determine clustering statistics. We produce catalogs that are weighted subsamples of the observed data, each matched to a weighted `random' catalog that forms an unclustered sampling of the probability density that DESI could have observed those data at each location.
Precise knowledge of the DESI observing history and associated hardware performance allows for a determination of the DESI footprint and the number of times DESI has covered it at sub-arcsecond level precision. This enables the completeness of any DESI sample to be modeled at this same resolution. The pipeline developed to create LSS catalogs has been designed to easily allow robustness tests and enable future improvements. We describe how it allows ongoing work improving the match between galaxy and random catalogs, such as including further information when assigning redshifts to randoms, accounting for fluctuations in target density, accounting for variation in the redshift success rate, and accommodating blinding schemes.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Model-free reinforcement learning with noisy actions for automated experimental control in optics
Authors:
Lea Richtmann,
Viktoria-S. Schmiesing,
Dennis Wilken,
Jan Heine,
Aaron Tranter,
Avishek Anand,
Tobias J. Osborne,
Michèle Heurs
Abstract:
Experimental control involves a lot of manual effort with non-trivial decisions for precise adjustments. Here, we study the automatic experimental alignment for coupling laser light into an optical fiber using reinforcement learning (RL). We face several real-world challenges, such as time-consuming training, partial observability, and noisy actions due to imprecision in the mirror steering motors…
▽ More
Experimental control involves a lot of manual effort with non-trivial decisions for precise adjustments. Here, we study the automatic experimental alignment for coupling laser light into an optical fiber using reinforcement learning (RL). We face several real-world challenges, such as time-consuming training, partial observability, and noisy actions due to imprecision in the mirror steering motors. We show that we can overcome these challenges: To save time, we use a virtual testbed to tune our environment for dealing with partial observability and use relatively sample-efficient model-free RL algorithms like Soft Actor-Critic (SAC) or Truncated Quantile Critics (TQC). Furthermore, by fully training on the experiment, the agent learns directly to handle the noise present. In our extensive experimentation, we show that we are able to achieve 90% coupling, showcasing the effectiveness of our proposed approaches. We reach this efficiency, which is comparable to that of a human expert, without additional feedback loops despite the motors' inaccuracies. Our result is an example of the readiness of RL for real-world tasks. We consider RL a promising tool for reducing the workload in labs.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Verifying Unboundedness via Amalgamation
Authors:
Ashwani Anand,
Sylvain Schmitz,
Lia Schütze,
Georg Zetzsche
Abstract:
Well-structured transition systems (WSTS) are an abstract family of systems that encompasses a vast landscape of infinite-state systems. By requiring a well-quasi-ordering (wqo) on the set of states, a WSTS enables generic algorithms for classic verification tasks such as coverability and termination. However, even for systems that are WSTS like vector addition systems (VAS), the framework is noto…
▽ More
Well-structured transition systems (WSTS) are an abstract family of systems that encompasses a vast landscape of infinite-state systems. By requiring a well-quasi-ordering (wqo) on the set of states, a WSTS enables generic algorithms for classic verification tasks such as coverability and termination. However, even for systems that are WSTS like vector addition systems (VAS), the framework is notoriously ill-equipped to analyse reachability (as opposed to coverability). Moreover, some important types of infinite-state systems fall out of WSTS' scope entirely, such as pushdown systems (PDS).
Inspired by recent algorithmic techniques on VAS, we propose an abstract notion of systems where the set of runs is equipped with a wqo and supports amalgamation of runs. We show that it subsumes a large class of infinite-state systems, including (reachability languages of) VAS and PDS, and even all systems from the abstract framework of valence systems, except for those already known to be Turing-complete.
Moreover, this abstract setting enables simple and general algorithmic solutions to unboundedness problems, which have received much attention in recent years. We present algorithms for the (i) simultaneous unboundedness problem (which implies computability of downward closures and decidability of separability by piecewise testable languages), (ii) computing priority downward closures, (iii) deciding whether a language is bounded, meaning included in $w_1^*\cdots w_k^*$ for some words $w_1,\ldots,w_k$, and (iv) effective regularity of unary languages. This leads to either drastically simpler proofs or new decidability results for a rich variety of systems.
△ Less
Submitted 20 June, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Probing the impact of radio-mode feedback on the properties of the cool circumgalactic medium
Authors:
Yu-Ling Chang,
Ting-Wen Lan,
J. Xavier Prochaska,
Lucas Napolitano,
Abhijeet Anand,
J. Aguilar,
S. Ahlen,
D. Brooks,
T. Claybaugh,
A. de la Macorra,
Arjun Dey,
P. Doel,
S. Gontcho A Gontcho,
J. Guy,
S. Juneau,
T. Kisner,
A. Lambert,
M. Landriau,
L. Le Guillou,
M. Manera,
P. Martini,
A. Meisner,
R. Miquel,
J. Moustakas,
A. D. Myers
, et al. (11 additional authors not shown)
Abstract:
We explore the influence of radio-mode feedback on the properties of the cool circumgalactic medium (CGM). To this end, we assemble a statistical sample of approximately 30,000 radio galaxies with background quasars by combining optical spectroscopic measurements of luminous red galaxies (LRGs) and quasars from the year 1 dataset of Dark Energy Spectroscopic Instrument (DESI) and radio sources fro…
▽ More
We explore the influence of radio-mode feedback on the properties of the cool circumgalactic medium (CGM). To this end, we assemble a statistical sample of approximately 30,000 radio galaxies with background quasars by combining optical spectroscopic measurements of luminous red galaxies (LRGs) and quasars from the year 1 dataset of Dark Energy Spectroscopic Instrument (DESI) and radio sources from the LOw-Frequency ARray Two-metre Sky Survey (LoTSS) DR2 catalog and the Very Large Array Sky Survey (VLASS) quick look catalog. Galaxies with similar optical properties but with no radio counterparts in LoTSS and VLASS are selected as the control group. We measure the cool CGM properties of radio galaxies and their control samples traced by MgII absorption lines, including covering fraction, rest equivalent width, and gas kinematics. Our results show no significant difference in the properties of gas around radio galaxies and their control sample, indicating that the operating radio-mode feedback of massive galaxies does not produce detectable effects on the properties of the cool CGM. Finally, we show that the CGM of radio galaxies contain a non-negligible amount of cool gas with approximately 10^10 solar masses. This abundance can place a stringent constraint on the radio-mode feedback models.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Is Interpretable Machine Learning Effective at Feature Selection for Neural Learning-to-Rank?
Authors:
Lijun Lyu,
Nirmal Roy,
Harrie Oosterhuis,
Avishek Anand
Abstract:
Neural ranking models have become increasingly popular for real-world search and recommendation systems in recent years. Unlike their tree-based counterparts, neural models are much less interpretable. That is, it is very difficult to understand their inner workings and answer questions like how do they make their ranking decisions? or what document features do they find important? This is particu…
▽ More
Neural ranking models have become increasingly popular for real-world search and recommendation systems in recent years. Unlike their tree-based counterparts, neural models are much less interpretable. That is, it is very difficult to understand their inner workings and answer questions like how do they make their ranking decisions? or what document features do they find important? This is particularly disadvantageous since interpretability is highly important for real-world systems. In this work, we explore feature selection for neural learning-to-rank (LTR). In particular, we investigate six widely-used methods from the field of interpretable machine learning (ML) and introduce our own modification, to select the input features that are most important to the ranking behavior. To understand whether these methods are useful for practitioners, we further study whether they contribute to efficiency enhancement. Our experimental results reveal a large feature redundancy in several LTR benchmarks: the local selection method TabNet can achieve optimal ranking performance with less than 10 features; the global methods, particularly our G-L2X, require slightly more selected features, but exhibit higher potential in improving efficiency. We hope that our analysis of these feature selection methods will bring the fields of interpretable ML and LTR closer together.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Context-Enhanced Language Models for Generating Multi-Paper Citations
Authors:
Avinash Anand,
Kritarth Prasad,
Ujjwal Goel,
Mohit Gupta,
Naman Lal,
Astha Verma,
Rajiv Ratn Shah
Abstract:
Citation text plays a pivotal role in elucidating the connection between scientific documents, demanding an in-depth comprehension of the cited paper. Constructing citations is often time-consuming, requiring researchers to delve into extensive literature and grapple with articulating relevant content. To address this challenge, the field of citation text generation (CTG) has emerged. However, whi…
▽ More
Citation text plays a pivotal role in elucidating the connection between scientific documents, demanding an in-depth comprehension of the cited paper. Constructing citations is often time-consuming, requiring researchers to delve into extensive literature and grapple with articulating relevant content. To address this challenge, the field of citation text generation (CTG) has emerged. However, while earlier methods have primarily centered on creating single-sentence citations, practical scenarios frequently necessitate citing multiple papers within a single paragraph. To bridge this gap, we propose a method that leverages Large Language Models (LLMs) to generate multi-citation sentences. Our approach involves a single source paper and a collection of target papers, culminating in a coherent paragraph containing multi-sentence citation text. Furthermore, we introduce a curated dataset named MCG-S2ORC, composed of English-language academic research papers in Computer Science, showcasing multiple citation instances. In our experiments, we evaluate three LLMs LLaMA, Alpaca, and Vicuna to ascertain the most effective model for this endeavor. Additionally, we exhibit enhanced performance by integrating knowledge graphs from target papers into the prompts for generating citation text. This research underscores the potential of harnessing LLMs for citation generation, opening a compelling avenue for exploring the intricate connections between scientific documents.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks
Authors:
Avinash Anand,
Mohit Gupta,
Kritarth Prasad,
Navya Singla,
Sanjana Sanjeev,
Jatin Kumar,
Adarsh Raj Shivam,
Rajiv Ratn Shah
Abstract:
The rapid progress in the field of natural language processing (NLP) systems and the expansion of large language models (LLMs) have opened up numerous opportunities in the field of education and instructional methods. These advancements offer the potential for tailored learning experiences and immediate feedback, all delivered through accessible and cost-effective services. One notable application…
▽ More
The rapid progress in the field of natural language processing (NLP) systems and the expansion of large language models (LLMs) have opened up numerous opportunities in the field of education and instructional methods. These advancements offer the potential for tailored learning experiences and immediate feedback, all delivered through accessible and cost-effective services. One notable application area for this technological advancement is in the realm of solving mathematical problems. Mathematical problem-solving not only requires the ability to decipher complex problem statements but also the skill to perform precise arithmetic calculations at each step of the problem-solving process. However, the evaluation of the arithmetic capabilities of large language models remains an area that has received relatively little attention. In response, we introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. This dataset encompasses mathematical challenges of varying complexity and covers a wide range of mathematical concepts. Utilizing this dataset, we conduct fine-tuning experiments with three prominent LLMs: LLaMA-2, WizardMath, and MAmmoTH. These fine-tuned models serve as benchmarks for evaluating their performance on our dataset. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems. Consequently, MAmmoTH-13B establishes itself as a robust and dependable benchmark for addressing NCERT mathematics problems.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering
Authors:
Avinash Anand,
Janak Kapuriya,
Chhavi Kirtani,
Apoorv Singh,
Jay Saraf,
Naman Lal,
Jatin Kumar,
Adarsh Raj Shivam,
Astha Verma,
Rajiv Ratn Shah,
Roger Zimmermann
Abstract:
Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose…
▽ More
Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose an LMM-based chatbot to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. To improve the LMM's performance, we experiment with two techniques, RLHF (Reinforcement Learning from Human Feedback) and Image Captioning. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors. We further explore the integration of Reinforcement Learning from Human Feedback (RLHF) methodology inspired by the ranking approach in RLHF to enhance the human-like problem-solving abilities of the models. The RLHF approach incorporates human feedback into the learning process of LLMs, improving the model's problem-solving skills, truthfulness, and reasoning capabilities, minimizing the hallucinations in the answers, and improving the quality instead of using vanilla-supervised fine-tuned models. We employ the LLaVA open-source model to answer multimodal physics MCQs and compare the performance with and without using RLHF.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Many-Shot In-Context Learning
Authors:
Rishabh Agarwal,
Avi Singh,
Lei M. Zhang,
Bernd Bohnet,
Luis Rosias,
Stephanie Chan,
Biao Zhang,
Ankesh Anand,
Zaheer Abbas,
Azade Nova,
John D. Co-Reyes,
Eric Chu,
Feryal Behbahani,
Aleksandra Faust,
Hugo Larochelle
Abstract:
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative…
▽ More
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
△ Less
Submitted 22 May, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content
Authors:
Avinash Anand,
Raj Jaiswal,
Pijush Bhuyan,
Mohit Gupta,
Siddhesh Bangar,
Md. Modassir Imam,
Rajiv Ratn Shah,
Shin'ichi Satoh
Abstract:
The automatic recognition of tabular data in document images presents a significant challenge due to the diverse range of table styles and complex structures. Tables offer valuable content representation, enhancing the predictive capabilities of various systems such as search engines and Knowledge Graphs. Addressing the two main problems, namely table detection (TD) and table structure recognition…
▽ More
The automatic recognition of tabular data in document images presents a significant challenge due to the diverse range of table styles and complex structures. Tables offer valuable content representation, enhancing the predictive capabilities of various systems such as search engines and Knowledge Graphs. Addressing the two main problems, namely table detection (TD) and table structure recognition (TSR), has traditionally been approached independently. In this research, we propose an end-to-end pipeline that integrates deep learning models, including DETR, CascadeTabNet, and PP OCR v2, to achieve comprehensive image-based table recognition. This integrated approach effectively handles diverse table styles, complex structures, and image distortions, resulting in improved accuracy and efficiency compared to existing methods like Table Transformers. Our system achieves simultaneous table detection (TD), table structure recognition (TSR), and table content recognition (TCR), preserving table structures and accurately extracting tabular data from document images. The integration of multiple models addresses the intricacies of table recognition, making our approach a promising solution for image-based table understanding, data extraction, and information retrieval applications. Our proposed approach achieves an IOU of 0.96 and an OCR Accuracy of 78%, showcasing a remarkable improvement of approximately 25% in the OCR Accuracy compared to the previous Table Transformer approach.
△ Less
Submitted 19 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
KG-CTG: Citation Generation through Knowledge Graph-guided Large Language Models
Authors:
Avinash Anand,
Mohit Gupta,
Kritarth Prasad,
Ujjwal Goel,
Naman Lal,
Astha Verma,
Rajiv Ratn Shah
Abstract:
Citation Text Generation (CTG) is a task in natural language processing (NLP) that aims to produce text that accurately cites or references a cited document within a source document. In CTG, the generated text draws upon contextual cues from both the source document and the cited paper, ensuring accurate and relevant citation information is provided. Previous work in the field of citation generati…
▽ More
Citation Text Generation (CTG) is a task in natural language processing (NLP) that aims to produce text that accurately cites or references a cited document within a source document. In CTG, the generated text draws upon contextual cues from both the source document and the cited paper, ensuring accurate and relevant citation information is provided. Previous work in the field of citation generation is mainly based on the text summarization of documents. Following this, this paper presents a framework, and a comparative study to demonstrate the use of Large Language Models (LLMs) for the task of citation generation. Also, we have shown the improvement in the results of citation generation by incorporating the knowledge graph relations of the papers in the prompt for the LLM to better learn the relationship between the papers. To assess how well our model is performing, we have used a subset of standard S2ORC dataset, which only consists of computer science academic research papers in the English Language. Vicuna performs best for this task with 14.15 Meteor, 12.88 Rouge-1, 1.52 Rouge-2, and 10.94 Rouge-L. Also, Alpaca performs best, and improves the performance by 36.98% in Rouge-1, and 33.14% in Meteor by including knowledge graphs.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization
Authors:
Avinash Anand,
Raj Jaiswal,
Mohit Gupta,
Siddhesh S Bangar,
Pijush Bhuyan,
Naman Lal,
Rajeev Singh,
Ritika Jha,
Rajiv Ratn Shah,
Shin'ichi Satoh
Abstract:
Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these…
▽ More
Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.
△ Less
Submitted 19 April, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting
Authors:
Avinash Anand,
Janak Kapuriya,
Apoorv Singh,
Jay Saraf,
Naman Lal,
Astha Verma,
Rushali Gupta,
Rajiv Shah
Abstract:
While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics pr…
▽ More
While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
DESI 2024 VI: Cosmological Constraints from the Measurements of Baryon Acoustic Oscillations
Authors:
DESI Collaboration,
A. G. Adame,
J. Aguilar,
S. Ahlen,
S. Alam,
D. M. Alexander,
M. Alvarez,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
B. Bahr-Kalus,
S. Bailey,
C. Baltay,
A. Bault,
J. Behera,
S. BenZvi,
A. Bera,
F. Beutler,
D. Bianchi,
C. Blake,
R. Blum
, et al. (178 additional authors not shown)
Abstract:
We present cosmological results from the measurement of baryon acoustic oscillations (BAO) in galaxy, quasar and Lyman-$α$ forest tracers from the first year of observations from the Dark Energy Spectroscopic Instrument (DESI), to be released in the DESI Data Release 1. DESI BAO provide robust measurements of the transverse comoving distance and Hubble rate, or their combination, relative to the s…
▽ More
We present cosmological results from the measurement of baryon acoustic oscillations (BAO) in galaxy, quasar and Lyman-$α$ forest tracers from the first year of observations from the Dark Energy Spectroscopic Instrument (DESI), to be released in the DESI Data Release 1. DESI BAO provide robust measurements of the transverse comoving distance and Hubble rate, or their combination, relative to the sound horizon, in seven redshift bins from over 6 million extragalactic objects in the redshift range $0.1<z<4.2$. DESI BAO data alone are consistent with the standard flat $Λ$CDM cosmological model with a matter density $Ω_\mathrm{m}=0.295\pm 0.015$. Paired with a BBN prior and the robustly measured acoustic angular scale from the CMB, DESI requires $H_0=(68.52\pm0.62)$ km/s/Mpc. In conjunction with CMB anisotropies from Planck and CMB lensing data from Planck and ACT, we find $Ω_\mathrm{m}=0.307\pm 0.005$ and $H_0=(67.97\pm0.38)$ km/s/Mpc. Extending the baseline model with a constant dark energy equation of state parameter $w$, DESI BAO alone require $w=-0.99^{+0.15}_{-0.13}$. In models with a time-varying dark energy equation of state parametrized by $w_0$ and $w_a$, combinations of DESI with CMB or with SN~Ia individually prefer $w_0>-1$ and $w_a<0$. This preference is 2.6$σ$ for the DESI+CMB combination, and persists or grows when SN~Ia are added in, giving results discrepant with the $Λ$CDM model at the $2.5σ$, $3.5σ$ or $3.9σ$ levels for the addition of Pantheon+, Union3, or DES-SN5YR datasets respectively. For the flat $Λ$CDM model with the sum of neutrino mass $\sum m_ν$ free, combining the DESI and CMB data yields an upper limit $\sum m_ν< 0.072$ $(0.113)$ eV at 95% confidence for a $\sum m_ν>0$ $(\sum m_ν>0.059)$ eV prior. These neutrino-mass constraints are substantially relaxed in models beyond $Λ$CDM. [Abridged.]
△ Less
Submitted 24 April, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
DESI 2024 IV: Baryon Acoustic Oscillations from the Lyman Alpha Forest
Authors:
DESI Collaboration,
A. G. Adame,
J. Aguilar,
S. Ahlen,
S. Alam,
D. M. Alexander,
M. Alvarez,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
S. Bailey,
C. Baltay,
A. Bault,
J. Bautista,
J. Behera,
S. BenZvi,
F. Beutler,
D. Bianchi,
C. Blake,
R. Blum,
S. Brieden
, et al. (174 additional authors not shown)
Abstract:
We present the measurement of Baryon Acoustic Oscillations (BAO) from the Lyman-$α$ (Ly$α$) forest of high-redshift quasars with the first-year dataset of the Dark Energy Spectroscopic Instrument (DESI). Our analysis uses over $420\,000$ Ly$α$ forest spectra and their correlation with the spatial distribution of more than $700\,000$ quasars. An essential facet of this work is the development of a…
▽ More
We present the measurement of Baryon Acoustic Oscillations (BAO) from the Lyman-$α$ (Ly$α$) forest of high-redshift quasars with the first-year dataset of the Dark Energy Spectroscopic Instrument (DESI). Our analysis uses over $420\,000$ Ly$α$ forest spectra and their correlation with the spatial distribution of more than $700\,000$ quasars. An essential facet of this work is the development of a new analysis methodology on a blinded dataset. We conducted rigorous tests using synthetic data to ensure the reliability of our methodology and findings before unblinding. Additionally, we conducted multiple data splits to assess the consistency of the results and scrutinized various analysis approaches to confirm their robustness. For a given value of the sound horizon ($r_d$), we measure the expansion at $z_{\rm eff}=2.33$ with 2\% precision, $H(z_{\rm eff}) = (239.2 \pm 4.8) (147.09~{\rm Mpc} /r_d)$ km/s/Mpc. Similarly, we present a 2.4\% measurement of the transverse comoving distance to the same redshift, $D_M(z_{\rm eff}) = (5.84 \pm 0.14) (r_d/147.09~{\rm Mpc})$ Gpc. Together with other DESI BAO measurements at lower redshifts, these results are used in a companion paper to constrain cosmological parameters.
△ Less
Submitted 12 April, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
DESI 2024 III: Baryon Acoustic Oscillations from Galaxies and Quasars
Authors:
DESI Collaboration,
A. G. Adame,
J. Aguilar,
S. Ahlen,
S. Alam,
D. M. Alexander,
M. Alvarez,
O. Alves,
A. Anand,
U. Andrade,
E. Armengaud,
S. Avila,
A. Aviles,
H. Awan,
S. Bailey,
C. Baltay,
A. Bault,
J. Behera,
S. BenZvi,
F. Beutler,
D. Bianchi,
C. Blake,
R. Blum,
S. Brieden,
A. Brodzeller
, et al. (171 additional authors not shown)
Abstract:
We present the DESI 2024 galaxy and quasar baryon acoustic oscillations (BAO) measurements using over 5.7 million unique galaxy and quasar redshifts in the range 0.1<z<2.1. Divided by tracer type, we utilize 300,017 galaxies from the magnitude-limited Bright Galaxy Survey with 0.1<z<0.4, 2,138,600 Luminous Red Galaxies with 0.4<z<1.1, 2,432,022 Emission Line Galaxies with 0.8<z<1.6, and 856,652 qu…
▽ More
We present the DESI 2024 galaxy and quasar baryon acoustic oscillations (BAO) measurements using over 5.7 million unique galaxy and quasar redshifts in the range 0.1<z<2.1. Divided by tracer type, we utilize 300,017 galaxies from the magnitude-limited Bright Galaxy Survey with 0.1<z<0.4, 2,138,600 Luminous Red Galaxies with 0.4<z<1.1, 2,432,022 Emission Line Galaxies with 0.8<z<1.6, and 856,652 quasars with 0.8<z<2.1, over a ~7,500 square degree footprint. The analysis was blinded at the catalog-level to avoid confirmation bias. All fiducial choices of the BAO fitting and reconstruction methodology, as well as the size of the systematic errors, were determined on the basis of the tests with mock catalogs and the blinded data catalogs. We present several improvements to the BAO analysis pipeline, including enhancing the BAO fitting and reconstruction methods in a more physically-motivated direction, and also present results using combinations of tracers. We present a re-analysis of SDSS BOSS and eBOSS results applying the improved DESI methodology and find scatter consistent with the level of the quoted SDSS theoretical systematic uncertainties. With the total effective survey volume of ~ 18 Gpc$^3$, the combined precision of the BAO measurements across the six different redshift bins is ~0.52%, marking a 1.2-fold improvement over the previous state-of-the-art results using only first-year data. We detect the BAO in all of these six redshift bins. The highest significance of BAO detection is $9.1σ$ at the effective redshift of 0.93, with a constraint of 0.86% placed on the BAO scale. We find our measurements are systematically larger than the prediction of Planck-2018 LCDM model at z<0.8. We translate the results into transverse comoving distance and radial Hubble distance measurements, which are used to constrain cosmological models in our companion paper [abridged].
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
The Surprising Effectiveness of Rankers Trained on Expanded Queries
Authors:
Abhijit Anand,
Venktesh V,
Vinay Setty,
Avishek Anand
Abstract:
An important problem in text-ranking systems is handling the hard queries that form the tail end of the query distribution. The difficulty may arise due to the presence of uncommon, underspecified, or incomplete queries. In this work, we improve the ranking performance of hard or difficult queries without compromising the performance of other queries. Firstly, we do LLM based query enrichment for…
▽ More
An important problem in text-ranking systems is handling the hard queries that form the tail end of the query distribution. The difficulty may arise due to the presence of uncommon, underspecified, or incomplete queries. In this work, we improve the ranking performance of hard or difficult queries without compromising the performance of other queries. Firstly, we do LLM based query enrichment for training queries using relevant documents. Next, a specialized ranker is fine-tuned only on the enriched hard queries instead of the original queries. We combine the relevance scores from the specialized ranker and the base ranker, along with a query performance score estimated for each query. Our approach departs from existing methods that usually employ a single ranker for all queries, which is biased towards easy queries, which form the majority of the query distribution. In our extensive experiments on the DL-Hard dataset, we find that a principled query performance based scoring method using base and specialized ranker offers a significant improvement of up to 25% on the passage ranking task and up to 48.4% on the document ranking task when compared to the baseline performance of using original queries, even outperforming SOTA model.
△ Less
Submitted 12 June, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
QuanTemp: A real-world open-domain benchmark for fact-checking numerical claims
Authors:
Venktesh V,
Abhijit Anand,
Avishek Anand,
Vinay Setty
Abstract:
Automated fact checking has gained immense interest to tackle the growing misinformation in the digital era. Existing systems primarily focus on synthetic claims on Wikipedia, and noteworthy progress has also been made on real-world claims. In this work, we release QuanTemp, a diverse, multi-domain dataset focused exclusively on numerical claims, encompassing temporal, statistical and diverse aspe…
▽ More
Automated fact checking has gained immense interest to tackle the growing misinformation in the digital era. Existing systems primarily focus on synthetic claims on Wikipedia, and noteworthy progress has also been made on real-world claims. In this work, we release QuanTemp, a diverse, multi-domain dataset focused exclusively on numerical claims, encompassing temporal, statistical and diverse aspects with fine-grained metadata and an evidence collection without leakage. This addresses the challenge of verifying real-world numerical claims, which are complex and often lack precise information, not addressed by existing works that mainly focus on synthetic claims. We evaluate and quantify the limitations of existing solutions for the task of verifying numerical claims. We also evaluate claim decomposition based methods, numerical understanding based models and our best baselines achieves a macro-F1 of 58.32. This demonstrates that QuanTemp serves as a challenging evaluation set for numerical claim verification.
△ Less
Submitted 1 May, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
LiteBIRD Science Goals and Forecasts: Primordial Magnetic Fields
Authors:
D. Paoletti,
J. Rubino-Martin,
M. Shiraishi,
D. Molinari,
J. Chluba,
F. Finelli,
C. Baccigalupi,
J. Errard,
A. Gruppuso,
A. I. Lonappan,
A. Tartari,
E. Allys,
A. Anand,
J. Aumont,
M. Ballardini,
A. J. Banday,
R. B. Barreiro,
N. Bartolo,
M. Bersanelli,
M. Bortolami,
T. Brinckmann,
E. Calabrese,
P. Campeti,
A. Carones,
F. J. Casas
, et al. (75 additional authors not shown)
Abstract:
We present detailed forecasts for the constraints on primordial magnetic fields (PMFs) that will be obtained with the LiteBIRD satellite. The constraints are driven by the effects of PMFs on the CMB anisotropies: the gravitational effects of magnetically-induced perturbations; the effects on the thermal and ionization history of the Universe; the Faraday rotation imprint on the CMB polarization; a…
▽ More
We present detailed forecasts for the constraints on primordial magnetic fields (PMFs) that will be obtained with the LiteBIRD satellite. The constraints are driven by the effects of PMFs on the CMB anisotropies: the gravitational effects of magnetically-induced perturbations; the effects on the thermal and ionization history of the Universe; the Faraday rotation imprint on the CMB polarization; and the non-Gaussianities induced in polarization anisotropies. LiteBIRD represents a sensitive probe for PMFs and by exploiting all the physical effects, it will be able to improve the current limit coming from Planck. In particular, thanks to its accurate $B$-mode polarization measurement, LiteBIRD will improve the constraints on infrared configurations for the gravitational effect, giving $B_{\rm 1\,Mpc}^{n_{\rm B} =-2.9} < 0.8$ nG at 95% C.L., potentially opening the possibility to detect nanogauss fields with high significance. We also observe a significant improvement in the limits when marginalized over the spectral index, $B_{1\,{\rm Mpc}}^{\rm marg}< 2.2$ nG at 95% C.L. From the thermal history effect, which relies mainly on $E$-mode polarization data, we obtain a significant improvement for all PMF configurations, with the marginalized case, $\sqrt{\langle B^2\rangle}^{\rm marg}<0.50$ nG at 95% C.L. Faraday rotation constraints will take advantage of the wide frequency coverage of LiteBIRD and the high sensitivity in $B$ modes, improving the limits by orders of magnitude with respect to current results, $B_{1\,{\rm Mpc}}^{n_{\rm B} =-2.9} < 3.2$ nG at 95% C.L. Finally, non-Gaussianities of the $B$-mode polarization can probe PMFs at the level of 1 nG, again significantly improving the current bounds from Planck. Altogether our forecasts represent a broad collection of complementary probes, providing conservative limits on PMF characteristics that will be achieved with LiteBIRD.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
RankingSHAP -- Listwise Feature Attribution Explanations for Ranking Models
Authors:
Maria Heuss,
Maarten de Rijke,
Avishek Anand
Abstract:
Feature attributions are a commonly used explanation type, when we want to posthoc explain the prediction of a trained model. Yet, they are not very well explored in IR. Importantly, feature attribution has rarely been rigorously defined, beyond attributing the most important feature the highest value. What it means for a feature to be more important than others is often left vague. Consequently,…
▽ More
Feature attributions are a commonly used explanation type, when we want to posthoc explain the prediction of a trained model. Yet, they are not very well explored in IR. Importantly, feature attribution has rarely been rigorously defined, beyond attributing the most important feature the highest value. What it means for a feature to be more important than others is often left vague. Consequently, most approaches focus on just selecting the most important features and under utilize or even ignore the relative importance within features. In this work, we rigorously define the notion of feature attribution for ranking models, and list essential properties that a valid attribution should have. We then propose RankingSHAP as a concrete instantiation of a list-wise ranking attribution method. Contrary to current explanation evaluation schemes that focus on selections, we propose two novel evaluation paradigms for evaluating attributions over learning-to-rank models. We evaluate RankingSHAP for commonly used learning-to-rank datasets to showcase the more nuanced use of an attribution method while highlighting the limitations of selection-based explanations. In a simulated experiment we design an interpretable model to demonstrate how list-wise ranking attributes can be used to investigate model decisions and evaluate the explanations qualitatively. Because of the contrastive nature of the ranking task, our understanding of ranking model decisions can substantially benefit from feature attribution explanations like RankingSHAP.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Modified gravity theories from the Barrow hypothesis
Authors:
Ankit Anand,
Ruben Campos Delgado
Abstract:
Barrow proposed that quantum gravity effects might introduce fractal corrections to the area of the event horizon of black holes. The area law gets modified as $S \propto A^{1+Δ/2}$, with $0\leqΔ\leq 1$. It was so far unclear whether this assumption could lead to meaningful quantum gravity theories beyond general relativity. In this paper, we argue that this is indeed the case. In particular, assu…
▽ More
Barrow proposed that quantum gravity effects might introduce fractal corrections to the area of the event horizon of black holes. The area law gets modified as $S \propto A^{1+Δ/2}$, with $0\leqΔ\leq 1$. It was so far unclear whether this assumption could lead to meaningful quantum gravity theories beyond general relativity. In this paper, we argue that this is indeed the case. In particular, assuming $Δ$ to be a radial function, we show that the Barrow hypothesis, together with the Jacobson's approach can generate non-trivial modified gravity theories.
△ Less
Submitted 15 May, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Approximating Small Sparse Cuts
Authors:
Aditya Anand,
Euiwoong Lee,
Jason Li,
Thatchaphol Saranurak
Abstract:
We study polynomial-time approximation algorithms for (edge/vertex) Sparsest Cut and Small Set Expansion in terms of $k$, the number of edges or vertices cut in the optimal solution. Our main results are $\mathcal{O}(\text{polylog}\, k)$-approximation algorithms for various versions in this setting.
Our techniques involve an extension of the notion of sample sets (Feige and Mahdian STOC'06), ori…
▽ More
We study polynomial-time approximation algorithms for (edge/vertex) Sparsest Cut and Small Set Expansion in terms of $k$, the number of edges or vertices cut in the optimal solution. Our main results are $\mathcal{O}(\text{polylog}\, k)$-approximation algorithms for various versions in this setting.
Our techniques involve an extension of the notion of sample sets (Feige and Mahdian STOC'06), originally developed for small balanced cuts, to sparse cuts in general. We then show how to combine this notion of sample sets with two algorithms, one based on an existing framework of LP rounding and another new algorithm based on the cut-matching game, to get such approximation algorithms. Our cut-matching game algorithm can be viewed as a local version of the cut-matching game by Khandekar, Khot, Orecchia and Vishnoi and certifies an expansion of every vertex set of size $s$ in $\mathcal{O}(\log s)$ rounds. These techniques may be of independent interest.
As corollaries of our results, we also obtain an $\mathcal{O}(\log opt)$-approximation for min-max graph partitioning, where $opt$ is the min-max value of the optimal cut, and improve the bound on the size of multicut mimicking networks computable in polynomial time.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Don't Blame the Data, Blame the Model: Understanding Noise and Bias When Learning from Subjective Annotations
Authors:
Abhishek Anand,
Negar Mokhberian,
Prathyusha Naresh Kumar,
Anweasha Saha,
Zihao He,
Ashwin Rao,
Fred Morstatter,
Kristina Lerman
Abstract:
Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreem…
▽ More
Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Frailty or Frailties: Exploring Frailty Index Subdimensions in the English Longitudinal Study of Ageing
Authors:
Lara Johnson,
Bruce Guthrie,
Paul A T Kelly,
Atul Anand,
Alan Marshall,
Sohan Seth
Abstract:
Background: Frailty, a state of increased vulnerability to adverse health outcomes, has garnered significant attention in research and clinical practice. Existing constructs aggregate clinical features or health deficits into a single score. While simple and interpretable, this approach may overlook the complexity of frailty and not capture the full range of variation between individuals.
Method…
▽ More
Background: Frailty, a state of increased vulnerability to adverse health outcomes, has garnered significant attention in research and clinical practice. Existing constructs aggregate clinical features or health deficits into a single score. While simple and interpretable, this approach may overlook the complexity of frailty and not capture the full range of variation between individuals.
Methods: Exploratory factor analysis was used to infer latent dimensions of a frailty index constructed using survey data from the English Longitudinal Study of Ageing (ELSA), wave 9. The dataset included 58 self-reported health deficits in a representative sample of community-dwelling adults aged 65+ (N = 4971). Deficits encompassed chronic disease, general health status, mobility, independence with activities of daily living, psychological wellbeing, memory and cognition. Multiple linear regression examined associations with CASP-19 quality of life scores.
Results: Factor analysis revealed four frailty subdimensions. Based on the component deficits with the highest loading values, these factors were labelled "Mobility Impairment and Physical Morbidity", "Difficulties in Daily Activities", "Mental Health" and "Disorientation in Time". The four subdimensions were a better predictor of quality of life than frailty index scores.
Conclusions: Distinct subdimensions of frailty can be identified from standard index scores. A decomposed approach to understanding frailty has potential to provide a more nuanced understanding of an individual's state of health across multiple deficits.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Code as Reward: Empowering Reinforcement Learning with VLMs
Authors:
David Venuto,
Sami Nur Islam,
Martin Klissarov,
Doina Precup,
Sherry Yang,
Ankit Anand
Abstract:
Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations…
▽ More
Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Transfer Learning for the Prediction of Entity Modifiers in Clinical Text: Application to Opioid Use Disorder Case Detection
Authors:
Abdullateef I. Almudaifer,
Whitney Covington,
JaMor Hairston,
Zachary Deitch,
Ankit Anand,
Caleb M. Carroll,
Estera Crisan,
William Bradford,
Lauren Walter,
Eaton Ellen,
Sue S. Feldman,
John D. Osborne
Abstract:
Background: The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier.
Methods: We develop and evaluate a multi-task tr…
▽ More
Background: The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier.
Methods: We develop and evaluate a multi-task transformer architecture design where modifiers are learned and predicted jointly using the publicly available SemEval 2015 Task 14 corpus and a new Opioid Use Disorder (OUD) data set that contains modifiers shared with SemEval as well as novel modifiers specific for OUD. We evaluate the effectiveness of our multi-task learning approach versus previously published systems and assess the feasibility of transfer learning for clinical entity modifiers when only a portion of clinical modifiers are shared.
Results: Our approach achieved state-of-the-art results on the ShARe corpus from SemEval 2015 Task 14, showing an increase of 1.1% on weighted accuracy, 1.7% on unweighted accuracy, and 10% on micro F1 scores.
Conclusions: We show that learned weights from our shared model can be effectively transferred to a new partially matched data set, validating the use of transfer learning for clinical text modifiers
△ Less
Submitted 5 February, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Separating $k$-Median from the Supplier Version
Authors:
Aditya Anand,
Euiwoong Lee
Abstract:
Given a metric space $(V, d)$ along with an integer $k$, the $k$-Median problem asks to open $k$ centers $C \subseteq V$ to minimize $\sum_{v \in V} d(v, C)$, where $d(v, C) := \min_{c \in C} d(v, c)$. While the best-known approximation ratio of $2.613$ holds for the more general supplier version where an additional set $F \subseteq V$ is given with the restriction $C \subseteq F$, the best known…
▽ More
Given a metric space $(V, d)$ along with an integer $k$, the $k$-Median problem asks to open $k$ centers $C \subseteq V$ to minimize $\sum_{v \in V} d(v, C)$, where $d(v, C) := \min_{c \in C} d(v, c)$. While the best-known approximation ratio of $2.613$ holds for the more general supplier version where an additional set $F \subseteq V$ is given with the restriction $C \subseteq F$, the best known hardness for these two versions are $1+1/e \approx 1.36$ and $1+2/e \approx 1.73$ respectively, using the same reduction from Max $k$-Coverage. We prove the following two results separating them.
First, we show a $1.546$-parameterized approximation algorithm that runs in time $f(k) n^{O(1)}$. Since $1+2/e$ is proved to be the optimal approximation ratio for the supplier version in the parameterized setting, this result separates the original $k$-Median from the supplier version.
Next, we prove a $1.416$-hardness for polynomial-time algorithms assuming the Unique Games Conjecture. This is achieved via a new fine-grained hardness of Max-$k$-Coverage for small set sizes.
Our upper bound and lower bound are derived from almost the same expression, with the only difference coming from the well-known separation between the powers of LP and SDP on (hypergraph) vertex cover.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Temporal Blind Spots in Large Language Models
Authors:
Jonas Wallat,
Adam Jatowt,
Avishek Anand
Abstract:
Large language models (LLMs) have recently gained significant attention due to their unparalleled ability to perform various natural language processing tasks. These models, benefiting from their advanced natural language understanding capabilities, have demonstrated impressive zero-shot performance. However, the pre-training data utilized in LLMs is often confined to a specific corpus, resulting…
▽ More
Large language models (LLMs) have recently gained significant attention due to their unparalleled ability to perform various natural language processing tasks. These models, benefiting from their advanced natural language understanding capabilities, have demonstrated impressive zero-shot performance. However, the pre-training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available\footnote{https://github.com/jwallat/temporalblindspots}.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Hamiltonians, groups, graphs and ansätze
Authors:
Abhinav Anand,
Kenneth R. Brown
Abstract:
One promising application of near-term quantum devices is to prepare trial wavefunctions using short circuits for solving different problems via variational algorithms. For this purpose, we introduce a new circuit design that combines graph-based diagonalization circuits with arbitrary single-qubit rotation gates to get Hamiltonian-based graph states ansätze (H-GSA). We test the accuracy of the pr…
▽ More
One promising application of near-term quantum devices is to prepare trial wavefunctions using short circuits for solving different problems via variational algorithms. For this purpose, we introduce a new circuit design that combines graph-based diagonalization circuits with arbitrary single-qubit rotation gates to get Hamiltonian-based graph states ansätze (H-GSA). We test the accuracy of the proposed ansatz in estimating ground state energies of various molecules of size up to 12-qubits. Additionally, we compare the gate count and parameter number complexity of the proposed ansatz against previously proposed schemes and find an order magnitude reduction in gate count complexity with slight increase in the number of parameters. Our work represents a significant step towards constructing compact quantum circuits with good trainability and convergence properties and applications in solving chemistry and physics problems.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning
Authors:
Mehran Kazemi,
Hamidreza Alvari,
Ankit Anand,
Jialin Wu,
Xi Chen,
Radu Soricut
Abstract:
Large language models have shown impressive results for multi-hop mathematical reasoning when the input question is only textual. Many mathematical reasoning problems, however, contain both text and image. With the ever-increasing adoption of vision language models (VLMs), understanding their reasoning abilities for such problems is crucial. In this paper, we evaluate the reasoning capabilities of…
▽ More
Large language models have shown impressive results for multi-hop mathematical reasoning when the input question is only textual. Many mathematical reasoning problems, however, contain both text and image. With the ever-increasing adoption of vision language models (VLMs), understanding their reasoning abilities for such problems is crucial. In this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems. We procedurally create a synthetic dataset of geometry questions with controllable difficulty levels along multiple axes, thus enabling a systematic evaluation. The empirical results obtained using our benchmark for state-of-the-art VLMs indicate that these models are not as capable in subjects like geometry (and, by generalization, other topics requiring similar reasoning) as suggested by previous benchmarks. This is made especially clear by the construction of our benchmark at various depth levels, since solving higher-depth problems requires long chains of reasoning rather than additional memorized knowledge. We release the dataset for further research in this area.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Flow cross-overs under surface fluctuations in cylindrical nano-channel
Authors:
Aakash Anand,
A. Bhattacharyay
Abstract:
We analyse surface-fluctuations-driven fluid flow through nano-channels to investigate the interplay between boundary layer flow structures and the bulk flow of fluid under a pressure-head. Surface fluctuations of a wide range of frequencies (up to several thousands of Hertz) in a nano-channel keep the flow in the low Reynolds number regime. Using this advantage of low Reynolds number flow, we dev…
▽ More
We analyse surface-fluctuations-driven fluid flow through nano-channels to investigate the interplay between boundary layer flow structures and the bulk flow of fluid under a pressure-head. Surface fluctuations of a wide range of frequencies (up to several thousands of Hertz) in a nano-channel keep the flow in the low Reynolds number regime. Using this advantage of low Reynolds number flow, we develop a perturbation analysis of the fluid flow that clearly distinguishes the bulk flow under a pressure head around the axis of a nano-tube from its surface flow structure induced by fluctuations. In terms of particle transport under such flow conditions, there exists the opportunity to drag particles near the periphery of the nano-tube in a direction opposite to the bulk flow near the axis. This can potentially find applications in the separation, trapping, and filtration of particles under surface-driven flow through nano-tubes under widely varying conditions.
△ Less
Submitted 6 April, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Leveraging commuting groups for an efficient variational Hamiltonian ansatz
Authors:
Abhinav Anand,
Kenneth R. Brown
Abstract:
Efficiently calculating the low-lying eigenvalues of Hamiltonians, written as sums of Pauli operators, is a fundamental challenge in quantum computing. While various methods have been proposed to reduce the complexity of quantum circuits for this task, there remains room for further improvement. In this article, we introduce a new circuit design using commuting groups within the Hamiltonian to fur…
▽ More
Efficiently calculating the low-lying eigenvalues of Hamiltonians, written as sums of Pauli operators, is a fundamental challenge in quantum computing. While various methods have been proposed to reduce the complexity of quantum circuits for this task, there remains room for further improvement. In this article, we introduce a new circuit design using commuting groups within the Hamiltonian to further reduce the circuit complexity of Hamiltonian-based quantum circuits. Our approach involves partitioning the Pauli operators into mutually commuting clusters and finding Clifford unitaries that diagonalize each cluster. We then design an ansatz that uses these Clifford unitaries for efficient switching between the clusters, complemented by a layer of parameterized single qubit rotations for each individual cluster. By conducting numerical simulations, we demonstrate the effectiveness of our method in accurately determining the ground state energy of different quantum chemistry Hamiltonians. Our results highlight the applicability and potential of our approach for designing problem-inspired ansatz for various quantum computing applications.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Authors:
Avi Singh,
John D. Co-Reyes,
Rishabh Agarwal,
Ankesh Anand,
Piyush Patil,
Xavier Garcia,
Peter J. Liu,
James Harrison,
Jaehoon Lee,
Kelvin Xu,
Aaron Parisi,
Abhishek Kumar,
Alex Alemi,
Alex Rizkowsky,
Azade Nova,
Ben Adlam,
Bernd Bohnet,
Gamaleldin Elsayed,
Hanie Sedghi,
Igor Mordatch,
Isabelle Simpson,
Izzeddin Gur,
Jasper Snoek,
Jeffrey Pennington,
Jiri Hron
, et al. (16 additional authors not shown)
Abstract:
Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investig…
▽ More
Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
△ Less
Submitted 17 April, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
LiteBIRD Science Goals and Forecasts: Improving Sensitivity to Inflationary Gravitational Waves with Multitracer Delensing
Authors:
T. Namikawa,
A. I. Lonappan,
C. Baccigalupi,
N. Bartolo,
D. Beck,
K. Benabed,
A. Challinor,
P. Diego-Palazuelos,
J. Errard,
S. Farrens,
A. Gruppuso,
N. Krachmalnicoff,
M. Migliaccio,
E. Martínez-González,
V. Pettorino,
G. Piccirilli,
M. Ruiz-Granda,
B. Sherwin,
J. Starck,
P. Vielva,
R. Akizawa,
A. Anand,
J. Aumont,
R. Aurlien,
S. Azzoni
, et al. (97 additional authors not shown)
Abstract:
We estimate the efficiency of mitigating the lensing $B$-mode polarization, the so-called delensing, for the $LiteBIRD$ experiment with multiple external data sets of lensing-mass tracers. The current best bound on the tensor-to-scalar ratio, $r$, is limited by lensing rather than Galactic foregrounds. Delensing will be a critical step to improve sensitivity to $r$ as measurements of $r$ become mo…
▽ More
We estimate the efficiency of mitigating the lensing $B$-mode polarization, the so-called delensing, for the $LiteBIRD$ experiment with multiple external data sets of lensing-mass tracers. The current best bound on the tensor-to-scalar ratio, $r$, is limited by lensing rather than Galactic foregrounds. Delensing will be a critical step to improve sensitivity to $r$ as measurements of $r$ become more and more limited by lensing. In this paper, we extend the analysis of the recent $LiteBIRD$ forecast paper to include multiple mass tracers, i.e., the CMB lensing maps from $LiteBIRD$ and CMB-S4-like experiment, cosmic infrared background, and galaxy number density from $Euclid$- and LSST-like survey. We find that multi-tracer delensing will further improve the constraint on $r$ by about $20\%$. In $LiteBIRD$, the residual Galactic foregrounds also significantly contribute to uncertainties of the $B$-modes, and delensing becomes more important if the residual foregrounds are further reduced by an improved component separation method.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
LiteBIRD Science Goals and Forecasts: A full-sky measurement of gravitational lensing of the CMB
Authors:
A. I. Lonappan,
T. Namikawa,
G. Piccirilli,
P. Diego-Palazuelos,
M. Ruiz-Granda,
M. Migliaccio,
C. Baccigalupi,
N. Bartolo,
D. Beck,
K. Benabed,
A. Challinor,
J. Errard,
S. Farrens,
A. Gruppuso,
N. Krachmalnicoff,
E. Martínez-González,
V. Pettorino,
B. Sherwin,
J. Starck,
P. Vielva,
R. Akizawa,
A. Anand,
J. Aumont,
R. Aurlien,
S. Azzoni
, et al. (97 additional authors not shown)
Abstract:
We explore the capability of measuring lensing signals in $LiteBIRD$ full-sky polarization maps. With a $30$ arcmin beam width and an impressively low polarization noise of $2.16\,μ$K-arcmin, $LiteBIRD$ will be able to measure the full-sky polarization of the cosmic microwave background (CMB) very precisely. This unique sensitivity also enables the reconstruction of a nearly full-sky lensing map u…
▽ More
We explore the capability of measuring lensing signals in $LiteBIRD$ full-sky polarization maps. With a $30$ arcmin beam width and an impressively low polarization noise of $2.16\,μ$K-arcmin, $LiteBIRD$ will be able to measure the full-sky polarization of the cosmic microwave background (CMB) very precisely. This unique sensitivity also enables the reconstruction of a nearly full-sky lensing map using only polarization data, even considering its limited capability to capture small-scale CMB anisotropies. In this paper, we investigate the ability to construct a full-sky lensing measurement in the presence of Galactic foregrounds, finding that several possible biases from Galactic foregrounds should be negligible after component separation by harmonic-space internal linear combination. We find that the signal-to-noise ratio of the lensing is approximately $40$ using only polarization data measured over $90\%$ of the sky. This achievement is comparable to $Planck$'s recent lensing measurement with both temperature and polarization and represents a four-fold improvement over $Planck$'s polarization-only lensing measurement. The $LiteBIRD$ lensing map will complement the $Planck$ lensing map and provide several opportunities for cross-correlation science, especially in the northern hemisphere.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
LiteBIRD Science Goals and Forecasts. A Case Study of the Origin of Primordial Gravitational Waves using Large-Scale CMB Polarization
Authors:
P. Campeti,
E. Komatsu,
C. Baccigalupi,
M. Ballardini,
N. Bartolo,
A. Carones,
J. Errard,
F. Finelli,
R. Flauger,
S. Galli,
G. Galloni,
S. Giardiello,
M. Hazumi,
S. Henrot-Versillé,
L. T. Hergt,
K. Kohri,
C. Leloup,
J. Lesgourgues,
J. Macias-Perez,
E. Martínez-González,
S. Matarrese,
T. Matsumura,
L. Montier,
T. Namikawa,
D. Paoletti
, et al. (85 additional authors not shown)
Abstract:
We study the possibility of using the $LiteBIRD$ satellite $B$-mode survey to constrain models of inflation producing specific features in CMB angular power spectra. We explore a particular model example, i.e. spectator axion-SU(2) gauge field inflation. This model can source parity-violating gravitational waves from the amplification of gauge field fluctuations driven by a pseudoscalar "axionlike…
▽ More
We study the possibility of using the $LiteBIRD$ satellite $B$-mode survey to constrain models of inflation producing specific features in CMB angular power spectra. We explore a particular model example, i.e. spectator axion-SU(2) gauge field inflation. This model can source parity-violating gravitational waves from the amplification of gauge field fluctuations driven by a pseudoscalar "axionlike" field, rolling for a few e-folds during inflation. The sourced gravitational waves can exceed the vacuum contribution at reionization bump scales by about an order of magnitude and can be comparable to the vacuum contribution at recombination bump scales. We argue that a satellite mission with full sky coverage and access to the reionization bump scales is necessary to understand the origin of the primordial gravitational wave signal and distinguish among two production mechanisms: quantum vacuum fluctuations of spacetime and matter sources during inflation. We present the expected constraints on model parameters from $LiteBIRD$ satellite simulations, which complement and expand previous studies in the literature. We find that $LiteBIRD$ will be able to exclude with high significance standard single-field slow-roll models, such as the Starobinsky model, if the true model is the axion-SU(2) model with a feature at CMB scales. We further investigate the possibility of using the parity-violating signature of the model, such as the $TB$ and $EB$ angular power spectra, to disentangle it from the standard single-field slow-roll scenario. We find that most of the discriminating power of $LiteBIRD$ will reside in $BB$ angular power spectra rather than in $TB$ and $EB$ correlations.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Data Augmentation for Sample Efficient and Robust Document Ranking
Authors:
Abhijit Anand,
Jurek Leonhardt,
Jaspreet Singh,
Koustav Rudra,
Avishek Anand
Abstract:
Contextual ranking models have delivered impressive performance improvements over classical models in the document ranking task. However, these highly over-parameterized models tend to be data-hungry and require large amounts of data even for fine-tuning. In this paper, we propose data-augmentation methods for effective and robust ranking performance. One of the key benefits of using data augmenta…
▽ More
Contextual ranking models have delivered impressive performance improvements over classical models in the document ranking task. However, these highly over-parameterized models tend to be data-hungry and require large amounts of data even for fine-tuning. In this paper, we propose data-augmentation methods for effective and robust ranking performance. One of the key benefits of using data augmentation is in achieving sample efficiency or learning effectively when we have only a small amount of training data. We propose supervised and unsupervised data augmentation schemes by creating training data using parts of the relevant documents in the query-document pairs. We then adapt a family of contrastive losses for the document ranking task that can exploit the augmented data to learn an effective ranking model. Our extensive experiments on subsets of the MS MARCO and TREC-DL test sets show that data augmentation, along with the ranking-adapted contrastive losses, results in performance improvements under most dataset sizes. Apart from sample efficiency, we conclusively show that data augmentation results in robust models when transferred to out-of-domain benchmarks. Our performance improvements in in-domain and more prominently in out-of-domain benchmarks show that augmentation regularizes the ranking model and improves its robustness and generalization capability.
△ Less
Submitted 26 November, 2023;
originally announced November 2023.
-
Noise in Relation Classification Dataset TACRED: Characterization and Reduction
Authors:
Akshay Parekh,
Ashish Anand,
Amit Awekar
Abstract:
The overarching objective of this paper is two-fold. First, to explore model-based approaches to characterize the primary cause of the noise. in the RE dataset TACRED Second, to identify the potentially noisy instances. Towards the first objective, we analyze predictions and performance of state-of-the-art (SOTA) models to identify the root cause of noise in the dataset. Our analysis of TACRED sho…
▽ More
The overarching objective of this paper is two-fold. First, to explore model-based approaches to characterize the primary cause of the noise. in the RE dataset TACRED Second, to identify the potentially noisy instances. Towards the first objective, we analyze predictions and performance of state-of-the-art (SOTA) models to identify the root cause of noise in the dataset. Our analysis of TACRED shows that the majority of the noise in the dataset originates from the instances labeled as no-relation which are negative examples. For the second objective, we explore two nearest-neighbor-based strategies to automatically identify potentially noisy examples for elimination and reannotation. Our first strategy, referred to as Intrinsic Strategy (IS), is based on the assumption that positive examples are clean. Thus, we have used false-negative predictions to identify noisy negative examples. Whereas, our second approach, referred to as Extrinsic Strategy, is based on using a clean subset of the dataset to identify potentially noisy negative examples. Finally, we retrained the SOTA models on the eliminated and reannotated dataset. Our empirical results based on two SOTA models trained on TACRED-E following the IS show an average 4% F1-score improvement, whereas reannotation (TACRED-R) does not improve the original results. However, following ES, SOTA models show the average F1-score improvement of 3.8% and 4.4% when trained on respective eliminated (TACRED-EN) and reannotated (TACRED-RN) datasets respectively. We further extended the ES for cleaning positive examples as well, which resulted in an average performance improvement of 5.8% and 5.6% for the eliminated (TACRED-ENP) and reannotated (TACRED-RNP) datasets respectively.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.