Search | arXiv e-print repository

An Interdisciplinary Outlook on Large Language Models for Scientific Research

Authors: James Boyko, Joseph Cohen, Nathan Fox, Maria Han Veiga, Jennifer I-Hsiu Li, Jing Liu, Bernardo Modenesi, Andreas H. Rauch, Kenneth N. Reid, Soumi Tribedi, Anastasia Visheratina, Xin Xie

Abstract: In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automat… ▽ More In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2303.12785 [pdf, other]

Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality

Authors: François Ged, Maria Han Veiga

Abstract: A novel Policy Gradient (PG) algorithm, called Matryoshka Policy Gradient (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximising entropy bonuses additional to its cumulative rewards. MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the… ▽ More A novel Policy Gradient (PG) algorithm, called Matryoshka Policy Gradient (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximising entropy bonuses additional to its cumulative rewards. MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the single standard objective. For softmax policies, we prove convergence of MPG and global optimality of the limit by showing that the only critical point of the MPG objective is the optimal policy; these results hold true even in the case of continuous compact state space. MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the standard max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework. Finally, we justify that MPG is well suited when the policies are parametrized with neural networks and we provide an simple criterion to verify the global optimality of the policy at convergence. As a proof of concept, we evaluate numerically MPG on standard test benchmarks. △ Less

Submitted 25 June, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

MSC Class: 68T07 ACM Class: I.2.0; I.2.6

arXiv:2107.09082 [pdf, other]

Reconstruction of the Density Power Spectrum from Quasar Spectra using Machine Learning

Authors: Maria Han Veiga, Xi Meng, Oleg Y. Gnedin, Nickolay Y. Gnedin, Xun Huan

Abstract: We describe a novel end-to-end approach using Machine Learning to reconstruct the power spectrum of cosmological density perturbations at high redshift from observed quasar spectra. State-of-the-art cosmological simulations of structure formation are used to generate a large synthetic dataset of line-of-sight absorption spectra paired with 1-dimensional fluid quantities along the same line-of-sigh… ▽ More We describe a novel end-to-end approach using Machine Learning to reconstruct the power spectrum of cosmological density perturbations at high redshift from observed quasar spectra. State-of-the-art cosmological simulations of structure formation are used to generate a large synthetic dataset of line-of-sight absorption spectra paired with 1-dimensional fluid quantities along the same line-of-sight, such as the total density of matter and the density of neutral atomic hydrogen. With this dataset, we build a series of data-driven models to predict the power spectrum of total matter density. We are able to produce models which yield reconstruction to accuracy of about 1% for wavelengths $k \leq 2 h Mpc^{-1}$, while the error increases at larger $k$. We show the size of data sample required to reach a particular error rate, giving a sense of how much data is necessary to reach a desired accuracy. This work provides a foundation for developing methods to analyse very large upcoming datasets with the next-generation observational facilities. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: 10 pages, 9 figures

arXiv:1901.02646 [pdf, other]

What do Language Representations Really Represent?

Authors: Johannes Bjerva, Robert Östling, Maria Han Veiga, Jörg Tiedemann, Isabelle Augenstein

Abstract: A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpu… ▽ More A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another. △ Less

Submitted 9 January, 2019; originally announced January 2019.

Comments: 8 pages, accepted for publication in Computational Linguistics (squib)

arXiv:1607.03274 [pdf, other]

doi 10.1145/2911451.2914666

A Cross-Platform Collection of Social Network Profiles

Authors: Maria Han Veiga, Carsten Eickhoff

Abstract: The proliferation of Internet-enabled devices and services has led to a shifting balance between digital and analogue aspects of our everyday lives. In the face of this development there is a growing demand for the study of privacy hazards, the potential for unique user de-anonymization and information leakage between the various social media profiles many of us maintain. To enable the structured… ▽ More The proliferation of Internet-enabled devices and services has led to a shifting balance between digital and analogue aspects of our everyday lives. In the face of this development there is a growing demand for the study of privacy hazards, the potential for unique user de-anonymization and information leakage between the various social media profiles many of us maintain. To enable the structured study of such adversarial effects, this paper presents a dedicated dataset of cross-platform social network personas (i.e., the same person has accounts on multiple platforms). The corpus comprises 850 users who generate predominantly English content. Each user object contains the online footprint of the same person in three distinct social networks: Twitter, Instagram and Foursquare. In total, it encompasses over 2.5M tweets, 340k check-ins and 42k Instagram posts. We describe the collection methodology, characteristics of the dataset, and how to obtain it. Finally, we discuss a common use case, cross-platform user identification. △ Less

Submitted 12 July, 2016; originally announced July 2016.

Comments: 4 pages, 5 figures, SIGIR 2016, short paper. SIGIR 2016 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

arXiv:1607.02714 [pdf, other]

Privacy Leakage through Innocent Content Sharing in Online Social Networks

Authors: Maria Han Veiga, Carsten Eickhoff

Abstract: The increased popularity and ubiquitous availability of online social networks and globalised Internet access have affected the way in which people share content. The information that users willingly disclose on these platforms can be used for various purposes, from building consumer models for advertising, to inferring personal, potentially invasive, information. In this work, we use Twitter, Ins… ▽ More The increased popularity and ubiquitous availability of online social networks and globalised Internet access have affected the way in which people share content. The information that users willingly disclose on these platforms can be used for various purposes, from building consumer models for advertising, to inferring personal, potentially invasive, information. In this work, we use Twitter, Instagram and Foursquare data to convey the idea that the content shared by users, especially when aggregated across platforms, can potentially disclose more information than was originally intended. We perform two case studies: First, we perform user de-anonymization by mimicking the scenario of finding the identity of a user making anonymous posts within a group of users. Empirical evaluation on a sample of real-world social network profiles suggests that cross-platform aggregation introduces significant performance gains in user identification. In the second task, we show that it is possible to infer physical location visits of a user on the basis of shared Twitter and Instagram content. We present an informativeness scoring function which estimates the relevance and novelty of a shared piece of information with respect to an inference task. This measure is validated using an active learning framework which chooses the most informative content at each given point in time. Based on a large-scale data sample, we show that by doing this, we can attain an improved inference performance. In some cases this performance exceeds even the use of the user's full timeline. △ Less

Submitted 10 July, 2016; originally announced July 2016.

Comments: 8 pages, 10 figures, submitted to Privacy Preserving Workshop, Sigir

Showing 1–6 of 6 results for author: Veiga, M H