Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10028 publications
    Taming Self-Training for Open-Vocabulary Object Detection
    Shiyu Zhao
    Samuel Schulter
    Zhixing Zhang
    Vijay Kumar B G
    Yumin Suh
    Manmohan Chandraker
    Dimitris N. Metaxas
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
    Preview abstract Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively. View details
    Connecting Language Technologies with Rich, Diverse Data Sources Covering Thousands of Languages
    Sebastian Ruder
    Julia Kreutzer
    Clara Rivera
    Ishank Saxena
    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
    Preview abstract Contrary to common belief, there are rich and diverse data sources available for many thousands of languages, which can be used to develop technologies for these languages. In this paper, we provide an overview of some of the major online data sources, the types of data that they provide access to, potential applications of this data, and the number of languages that they cover. Even this covers only a small fraction of the data that exists; for example, printed books are published in many languages but few online aggregators exist. View details
    Preview abstract This document specifies how to augment the Routing Policy Specification Language (RPSL) inetnum: class to refer specifically to geofeed comma-separated values (CSV) data files and describes an optional scheme that uses the Resource Public Key Infrastructure (RPKI) to authenticate the geofeed data files. This document obsoletes RFC 9092. View details
    The Scaling Law of Synthetic Images for Model Training, for Now
    Lijie Fan
    Dilip Krishnan
    Dina Katabi
    Phillip Isola
    Yonglong Tian
    CVPR (2024)
    Preview abstract Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models. View details
    Preview abstract Recent developments in large language models (LLMs) have shown promise in their ability to generate synthetic query-document pairs by prompting LLMs with as few as 8 demonstrations \cite{dai2022promptagator}. This has enabled building better IR models especially for tasks which have no training data readily available. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. document) and generate a query that is relevant to that context or condition the QGen model additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as it requires the model to reason about the desired label and the input from only a handful of examples, which is not trivial, especially when the relevance buckets are nuanced. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels (e.g. relevance buckets). We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query with respect to a relevant query is a much simpler task setup for the model to reason about. Extensive experimentation across seven IR datasets shows that synthetic queries generated in such a fashion translates to a better downstream performance, suggesting that the generated queries are indeed of higher quality. View details
    Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines
    Yuchen Li
    Alexandre Kirchmeyer
    Aashay Mehta
    Yilong Qin
    Andrej Risteski
    International Conference on Machine Learning (2024) (to appear)
    Preview abstract Autoregressive language models are the currently dominant paradigm for text generation, however they have some fundamental limitations that cannot be remedied by scale ---for example inherently sequential and unidirectional generation. While alternate classes of models have been explored, we have limited mathematical understanding of their fundamental power and limitations. In this paper we focus on Generative Masked Language Models (GMLMs), a non-autoregressive paradigm in which we train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. These models empirically strike a promising speed-quality trade-off as each step can be typically parallelized by decoding the entire sequence in parallel. We develop a mathematical framework for analyzing and improving such models which sheds light on questions of sample complexity and inference speed and quality. Empirically, we adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality compared with autoregressive models. We run careful ablation experiments to give recommendations on key design choices, and make fine-grained observations on the common error modes in connection with our theory. Our mathematical analyses and empirical observations characterize both potentials and limitations of this approach, and can be applied to future works on improving understanding and performance of GMLMs. View details
    A data-centric perspective on the information needed for hydrological uncertainty predictions
    Andreas Auer
    Martin Gauch
    Frederik Kratzert
    Sepp Hochreiter
    Daniel Klotz
    Hydrology and Earth System Sciences (2024)
    Preview abstract Uncertainty estimates are fundamental to assess the reliability of predictive models in hydrology. We use the framework of conformal prediction to investigate the impact of temporal and spatial information on uncertainty estimates within hydrological predictions. Integrating recent information significantly enhances overall uncertainty predictions, even with substantial gaps between updates. While local information yields good results on average, it proves to be insufficient for peak-flow predictions. Incorporating global information improves the accuracy of peak-flow bounds, corroborating findings from related studies. Overall, the study underscores the importance of continuous data updates and the integration of global information for robust and efficient uncertainty estimation. View details
    SAC126 - DNSSEC Delegation Signer (DS) Record Automation
    Internet Corporation for Assigned Names and Numbers (ICANN), ICANN Security and Stability Advisory Committee (SSAC) Reports and Advisories (2024), pp. 39
    Preview abstract The deployment of Domain Name System (DNS) Security Extensions (DNSSEC) has been hindered by a number of obstacles. This report focuses on one: the management of Delegation Signer (DS) records, which connect a child zone’s DNSSEC public key and signatures to the chain of trust provided by its parent zone (e.g., a zone corresponding to a top-level domain). DNSSEC is not simply enabled by signing a delegated domain’s DNS zone with DNSSEC signatures. It is also necessary to configure (and later maintain) appropriate DS records, which involves coordinated actions by the DNS operator, registrant, registrar, and registry. In the case where the domain’s DNS service is operated by the registrar, this process can be reduced to a simple internal operation by the registrar. If the functions are separated, this is not possible. This report is therefore focused on when the domain’s DNS service is not operated by the registrar, but by a third-party DNS operator. In such a scenario, current practice holds the registrant responsible for coordinating DS maintenance. The registrant (or someone appointed by them) needs to first obtain DNSSEC public key parameters from the DNS operator, and convey these parameters to the registrar (potentially via a reseller). The registrar will then need to relay these DNSSEC public key parameters to the registry, who will use them to create and publish the DS record in the parent zone. This process often involves idiosyncratic interfaces for each combination of DNS operator and registrar, requiring a level of engagement and time investment, awareness, and understanding that often do not match with what the registrant knows or expects. The complexity of the process further introduces opportunity for error. This can be alleviated by employing automation for the data exchanges required for DS maintenance so that, when the domain’s DNS service is operated by a third party, registries or registrars can, without human involvement, obtain all information needed for keeping DS records up to date. Various approaches to achieve this are possible, such as a scheme where the registry or registrar actively contacts the Child DNS operator, or vice versa. The different approaches come with different challenges with respect to authentication, timing, and efficiency. The IETF has standardized specifications around the first approach, where the parent pulls information from the Child DNS operator, and operational experience has been gained over recent years. However, some standardization gaps remain (such as to improve efficiency and error handling). In addition, the industry could benefit from further development of best practices in deploying the technology. The SSAC believes that automated DS maintenance should be a goal for the domain name industry. To make this a reality, the SSAC makes several recommendations with the goal to spur industry players and ICANN towards an industry best practice for DNSSEC DS automation. View details
    Preview abstract This paper reflects on work at Google over the past decade to address common types of software safety and security defects. Our experience has shown that software safety is an emergent property of the software and tooling ecosystem it is developed in and the production environment into which it is deployed. Thus, to effectively prevent common weaknesses at scale, we need to shift-left the responsibility for ensuring safety and security invariants to the end-to-end developer ecosystem, that is, programming languages, software libraries, application frameworks, build and deployment tooling, the production platform and its configuration surfaces, and so forth. Doing so is practical and cost effective when developer ecosystems are designed with application archetypes in mind, such as web or mobile apps: The design of the developer ecosystem can address threat model aspects that apply commonly to all applications of the respective archetype, and investments to ensure safety invariants at the ecosystem level amortize across many applications. Applying secure-by-design principles to developer ecosystems at Google has achieved drastic reduction and in some cases near-zero residual rates of common classes of defects, across hundreds of applications being developed by thousands of developers. View details
    Preview abstract Japanese text-to-pronunciation modelling is a notoriously data-intensive problem. Japanese data sources are often only partially annotated, and use different annotation standards for pronunciation and word segmentation. This talk introduces a set of techniques that enable ingesting data that may be partially annotated, use arbitrary word segmentations, and use a variety of pronunciation annotation standards. View details
    Understanding metric-related pitfalls in image analysis validation
    Annika Reinke
    Lena Maier-Hein
    Paul Jager
    Shravya Shetty
    Understanding Metrics Workgroup
    Nature Methods (2024)
    Preview abstract Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation. View details
    Preview abstract Sequence labeling is a core task in text understanding for IE/IR systems. Text generation models have increasingly become the go-to solution for such tasks (e.g., entity extraction and dialog slot filling). While most research has focused on the labeling accuracy, a key aspect -- of vital practical importance -- has slipped through the cracks: understanding model confidence. More specifically, we lack a principled understanding of how to reliably gauge the confidence of a model in its predictions for each labeled span. This paper aims to provide some empirical insights on estimating model confidence for generative sequence labeling. Most notably, we find that simply using the decoder's output probabilities is not the best in realizing well-calibrated confidence estimates. As verified over six public datasets of different tasks, we show that our proposed approach -- which leverages statistics from top-k predictions by a beam search -- significantly reduces calibration errors of the predictions of a generative sequence labeling model. View details
    Preview abstract The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads, referred to as scaffolds. Our framework is generic yet adaptable to the unique characteristics of specific social media platforms. We demonstrate its feasibility using data from two distinct online discussion platforms. To address the fundamental challenge of ensuring the representativeness and realism of synthetic data, we propose a portfolio of evaluation measures to compare various instantiations of our framework. View details
    Meta-Manager: A Tool for Collecting and Exploring Meta Information about Code
    Amber Horvath
    Brad A. Myers
    CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems (2024)
    Preview abstract Modern software engineering is in a state of flux. With more development utilizing AI code generation tools and the continued reliance on online programming resources, understanding code and the original intent behind it is becoming more important than it ever has been. To this end, we have developed the “Meta-Manager”, a Visual Studio Code extension, with a supplementary browser extension, that automatically collects and organizes changes made to code while keeping track of the provenance of each part of the code, including code that has been copy-pasted from popular programming resources online. These sources and subsequent changes are represented in the editor and may be explored using searching and filtering mechanisms to help developers answer historically hard-to-answer questions about code, its provenance, and its design rationale. In our evaluation of Meta-Manager, we found developers were successfully able to use it to answer otherwise unanswerable questions about an unfamiliar code base. View details
    50 Shades of Support: A Device-Centric Analysis of Android Security Updates
    Abbas Acar
    Esteban Luques
    Harun Oz
    Ahmet Aris
    Selcuk Uluagac
    Network and Distributed System Security (NDSS) Symposium (2024)
    Preview abstract Android is by far the most popular OS with over three billion active mobile devices. As in any software, uncovering vulnerabilities on Android devices and applying timely patches are both critical. Android Open Source Project (AOSP) has initiated efforts to improve the traceability of security updates through Security Patch Levels (SPLs) assigned to devices. While this initiative provided better traceability for the vulnerabilities, it has not entirely resolved the issues related to the timeliness and availability of security updates for end users. Recent studies on Android security updates have focused on the issue of delay during the security update roll-out, largely attributing this to factors related to fragmentation. However, these studies fail to capture the entire Android ecosystem as they primarily examine flagship devices or do not paint a comprehensive picture of the Android devices’ lifecycle due to the datasets spanning over a short timeframe. To address this gap in the literature, we utilize a device-centric approach to analyze the security update behavior of Android devices. Our approach aims to understand the security update distribution behavior of OEMs (e.g., Samsung) by using a representative set of devices from each OEM and characterize the complete lifecycle of an average Android device. We obtained 367K official security update records from public sources, span- ning from 2014 to 2023. Our dataset contains 599 unique devices from four major OEMs that are used in 97 countries and are associated with 109 carriers. We identify significant differences in the roll-out of security updates across different OEMs, device models/types, and geographical regions across the world. Our findings show that the reasons for the delay in the roll-out of security updates are not limited to fragmentation but also involve OEM-specific factors. Our analysis also uncovers certain key issues that can be readily addressed as well as exemplary practices that can be immediately adopted by OEMs in practice. View details