Search | arXiv e-print repository

How to Data in Datathons

Authors: Carlos Mougan, Richard Plant, Clare Teng, Marya Bazzi, Alvaro Cabrejas-Egea, Ryan Sze-Yin Chan, David Salvador Jasin, Martin Stoffel, Kirstie Jane Whitaker, Jules Manser

Abstract: The rise of datathons, also known as data or data science hackathons, has provided a platform to collaborate, learn, and innovate in a short timeframe. Despite their significant potential benefits, organizations often struggle to effectively work with data due to a lack of clear guidelines and best practices for potential issues that might arise. Drawing on our own experiences and insights from or… ▽ More The rise of datathons, also known as data or data science hackathons, has provided a platform to collaborate, learn, and innovate in a short timeframe. Despite their significant potential benefits, organizations often struggle to effectively work with data due to a lack of clear guidelines and best practices for potential issues that might arise. Drawing on our own experiences and insights from organizing >80 datathon challenges with >60 partnership organizations since 2016, we provide guidelines and recommendations that serve as a resource for organizers to navigate the data-related complexities of datathons. We apply our proposed framework to 10 case studies. △ Less

Submitted 25 October, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark

arXiv:2204.09391 [pdf, other]

You Are What You Write: Preserving Privacy in the Era of Large Language Models

Authors: Richard Plant, Valerio Giuffrida, Dimitra Gkatzia

Abstract: Large scale adoption of large language models has introduced a new era of convenient knowledge transfer for a slew of natural language processing tasks. However, these models also run the risk of undermining user trust by exposing unwanted information about the data subjects, which may be extracted by a malicious party, e.g. through adversarial attacks. We present an empirical investigation into t… ▽ More Large scale adoption of large language models has introduced a new era of convenient knowledge transfer for a slew of natural language processing tasks. However, these models also run the risk of undermining user trust by exposing unwanted information about the data subjects, which may be extracted by a malicious party, e.g. through adversarial attacks. We present an empirical investigation into the extent of the personal information encoded into pre-trained representations by a range of popular models, and we show a positive correlation between the complexity of a model, the amount of data used in pre-training, and data leakage. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular privacy-preserving algorithms, on a large, multi-lingual dataset on sentiment analysis annotated with demographic information (location, age and gender). The results show since larger and more complex models are more prone to leaking private information, use of privacy-preserving methods is highly desirable. We also find that highly privacy-preserving technologies like differential privacy (DP) can have serious model utility effects, which can be ameliorated using hybrid or metric-DP techniques. △ Less

Submitted 20 April, 2022; originally announced April 2022.

arXiv:2112.02721 [pdf, other]

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter). △ Less

Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

arXiv:2108.12318 [pdf, other]

CAPE: Context-Aware Private Embeddings for Private Language Learning

Authors: Richard Plant, Dimitra Gkatzia, Valerio Giuffrida

Abstract: Deep learning-based language models have achieved state-of-the-art results in a number of applications including sentiment analysis, topic labelling, intent classification and others. Obtaining text representations or embeddings using these models presents the possibility of encoding personally identifiable information learned from language and context cues that may present a risk to reputation or… ▽ More Deep learning-based language models have achieved state-of-the-art results in a number of applications including sentiment analysis, topic labelling, intent classification and others. Obtaining text representations or embeddings using these models presents the possibility of encoding personally identifiable information learned from language and context cues that may present a risk to reputation or privacy. To ameliorate these issues, we propose Context-Aware Private Embeddings (CAPE), a novel approach which preserves privacy during training of embeddings. To maintain the privacy of text representations, CAPE applies calibrated noise through differential privacy, preserving the encoded semantic links while obscuring sensitive information. In addition, CAPE employs an adversarial training regime that obscures identified private variables. Experimental results demonstrate that the proposed approach reduces private information leakage better than either single intervention. △ Less

Submitted 27 August, 2021; originally announced August 2021.

Comments: Accepted into EMNLP21 main conference

arXiv:2103.16446 [pdf, other]

CovidTracker: A comprehensive Covid-related social media dataset for NLP tasks

Authors: Richard Plant, Amir Hussain

Abstract: The Covid-19 pandemic presented an unprecedented global public health emergency, and concomitantly an unparalleled opportunity to investigate public responses to adverse social conditions. The widespread ability to post messages to social media platforms provided an invaluable outlet for such an outpouring of public sentiment, including not only expressions of social solidarity, but also the sprea… ▽ More The Covid-19 pandemic presented an unprecedented global public health emergency, and concomitantly an unparalleled opportunity to investigate public responses to adverse social conditions. The widespread ability to post messages to social media platforms provided an invaluable outlet for such an outpouring of public sentiment, including not only expressions of social solidarity, but also the spread of misinformation and misconceptions around the effect and potential risks of the pandemic. This archive of message content therefore represents a key resource in understanding public responses to health crises, analysis of which could help to inform public policy interventions to better respond to similar events in future. We present a benchmark database of public social media postings from the United Kingdom related to the Covid-19 pandemic for academic research purposes, along with some initial analysis, including a taxonomy of key themes organised by keyword. This release supports the findings of a research study funded by the Scottish Government Chief Scientists' Office that aims to investigate social sentiment in order to understand the response to public health measures implemented during the pandemic. △ Less

Submitted 17 June, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

arXiv:1102.2636 [pdf, ps, other]

doi 10.1098/rsta.2011.0377

A new modelling framework for statistical cumulus dynamics

Authors: R. S. Plant

Abstract: We propose a new modelling framework suitable for the description of atmospheric convective systems as a collection of distinct plumes. The literature contains many examples of models for collections of plumes in which strong simplifying assumptions are made, a diagnostic dependence of convection on the large-scale environment and the limit of many plumes often being imposed from the outset. Some… ▽ More We propose a new modelling framework suitable for the description of atmospheric convective systems as a collection of distinct plumes. The literature contains many examples of models for collections of plumes in which strong simplifying assumptions are made, a diagnostic dependence of convection on the large-scale environment and the limit of many plumes often being imposed from the outset. Some recent studies have sought to remove one or the other of those assumptions. The proposed framework removes both, and is explicitly time-dependent and stochastic in its basic character. The statistical dynamics of the plume collection are defined through simple probabilistic rules applied at the level of individual plumes, and van Kampen's system size expansion is then used to construct the macroscopic limit of the microscopic model. Through suitable choices of the microscopic rules, the model is shown to encompass previous studies in the appropriate limits, and to allow their natural extensions beyond those limits. △ Less

Submitted 17 August, 2011; v1 submitted 13 February, 2011; originally announced February 2011.

Comments: edits to text in response to referee reports

arXiv:0801.4648 [pdf, ps, other]

doi 10.1002/qj.179

A note on boundary-layer friction in baroclinic cyclones

Authors: I. A. Boutle, R. J. Beare, S. E. Belcher, R. S. Plant

Abstract: The interaction between extratropical cyclones and the underlying boundary layer has been a topic of recent discussion in papers by Adamson et. al. (2006) and Beare (2007). Their results emphasise different mechanisms through which the boundary layer dynamics may modify the growth of a baroclinic cyclone. By using different sea-surface temperature distributions and comparing the low-level winds,… ▽ More The interaction between extratropical cyclones and the underlying boundary layer has been a topic of recent discussion in papers by Adamson et. al. (2006) and Beare (2007). Their results emphasise different mechanisms through which the boundary layer dynamics may modify the growth of a baroclinic cyclone. By using different sea-surface temperature distributions and comparing the low-level winds, the differences are exposed and both of the proposed mechanisms appear to be acting within a single simulation. △ Less

Submitted 30 January, 2008; originally announced January 2008.

Comments: 5 pages, 3 figures

Journal ref: Quarterly Journal of the Royal Meteorological Society, 133, 2137-2141 (2007)

arXiv:hep-ph/0007340 [pdf, ps, other]

doi 10.1016/S0375-9474(01)01669-4

Mesonic fluctuations in a nonlocal Nambu-Jona-Lasinio model

Authors: Robert S. Plant, Michael C. Birse

Abstract: The effects of meson fluctuations are studied in a nonlocal generalization of the Nambu-Jona-Lasinio model, by including terms of next-to-leading order (NLO) in 1/N_c. In the model with only scalar and pseudoscalar interactions NLO contributions to the quark condensate are found to be very small. This is a result of cancellation between virtual mesons and Fock terms, which occurs for the paramet… ▽ More The effects of meson fluctuations are studied in a nonlocal generalization of the Nambu-Jona-Lasinio model, by including terms of next-to-leading order (NLO) in 1/N_c. In the model with only scalar and pseudoscalar interactions NLO contributions to the quark condensate are found to be very small. This is a result of cancellation between virtual mesons and Fock terms, which occurs for the parameter sets of most interest. In the quark self-energy, similar cancellations arise in the tadpole diagrams, although not in other NLO pieces which contribute at the \sim 25% level. The effects on pion properties are also found to be small. NLO contributions from real $ππ$ intermediate states increase the sigma meson mass by $\sim 30%$. In an extended model with vector and axial interactions, there are indications that NLO effects could be larger. △ Less

Submitted 13 November, 2001; v1 submitted 28 July, 2000; originally announced July 2000.

Comments: 22 pages (RevTeX), 12 figures (using graphicx.sty), v3 has improved numerics

Report number: TH/00/05

Journal ref: Nucl.Phys. A703 (2002) 717-744

arXiv:hep-ph/9705372 [pdf, ps, other]

doi 10.1016/S0375-9474(97)00635-0

Meson properties in an extended nonlocal NJL model

Authors: Robert S. Plant, Michael C. Birse

Abstract: We consider a nonlocal version of the NJL model, based on a separable quark-quark interaction. The interaction is extended to include terms that bind vector and axial-vector mesons. The nonlocality means that no further regulator is required. Moreover the model is able to confine the quarks by generating a quark propagator without poles at real energies. Working in the ladder approximation, we c… ▽ More We consider a nonlocal version of the NJL model, based on a separable quark-quark interaction. The interaction is extended to include terms that bind vector and axial-vector mesons. The nonlocality means that no further regulator is required. Moreover the model is able to confine the quarks by generating a quark propagator without poles at real energies. Working in the ladder approximation, we calculate amplitudes in Euclidean space and discuss features of their continuation to Minkowski energies. Conserved currents are constructed and we demonstrate their consistency with various Ward identities. Various meson masses are calculated, along with their strong and electromagnetic decay amplitudes. We also calculate the electromagnetic form factor of the pion, as well as form factors associated with the processes gamma gamma^* to pi^0 and omega to pi^0 gamma^*. The results are found to lead to a satisfactory phenomenology and lend some dynamical support to the idea of vector-meson dominance. △ Less

Submitted 17 November, 1997; v1 submitted 21 May, 1997; originally announced May 1997.

Comments: 56 pages (RevTeX), 13 figures (using espfig.sty and axodraw.sty); revised version to appear in Nuclear Physics A: further discussion of several points, Fig. 12 updated to include recent CLEO data, misprints corrected

Report number: MC/TH 97/06

Journal ref: Nucl.Phys. A628 (1998) 607-644

arXiv:hep-ph/9508356 [pdf, ps, other]

doi 10.1016/0370-2693(95)01273-7

$ρ\to 4π$ in chirally symmetric models

Authors: Robert S. Plant, Michael C. Birse

Abstract: The decays $ρ^0\to 2π^+2π^-$ and $ρ^0\to 2π^0π^+π^-$ are studied using various effective Lagrangians for $π$ and $ρ$ (and in some cases $a_1$) mesons, all of which respect the approximate chiral symmetry of the strong interaction. Partial widths of the order of 1 keV or less are found in all cases. These are an order of magnitude smaller than recent predictions based on non-chiral models. The decays $ρ^0\to 2π^+2π^-$ and $ρ^0\to 2π^0π^+π^-$ are studied using various effective Lagrangians for $π$ and $ρ$ (and in some cases $a_1$) mesons, all of which respect the approximate chiral symmetry of the strong interaction. Partial widths of the order of 1 keV or less are found in all cases. These are an order of magnitude smaller than recent predictions based on non-chiral models. △ Less

Submitted 23 August, 1995; originally announced August 1995.

Comments: 12 pages (RevTeX)

Report number: MC/TH 95/14

Journal ref: Phys.Lett. B365 (1996) 292

Showing 1–10 of 10 results for author: Plant, R