Search | arXiv e-print repository

Allocation Requires Prediction Only if Inequality Is Low

Authors: Ali Shirali, Rediet Abebe, Moritz Hardt

Abstract: Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in set… ▽ More Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics' learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Appeared in Forty-first International Conference on Machine Learning (ICML), 2024

arXiv:2406.03422 [pdf, other]

Causal Inference from Competing Treatments

Authors: Ana-Andreea Stoica, Vivian Y. Nastl, Moritz Hardt

Abstract: Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoreti… ▽ More Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoretic model of agents who wish to estimate causal effects in the presence of competition, through a bidding system and a utility function that minimizes estimation error. Our main technical result establishes an approximation with a tractable objective that maximizes the sample value obtained through strategically allocating budget on subjects. This allows us to find an equilibrium in our model: we show that the tractable objective has a pure Nash equilibrium, and that any Nash equilibrium is an approximate equilibrium for our general objective that minimizes estimation error under broad conditions. Conceptually, our work successfully combines elements from causal inference and game theory to shed light on the equilibrium behavior of experimentation under competition. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 37 pages, 3 figures, accepted at ICML'24

arXiv:2405.19073 [pdf, other]

An engine not a camera: Measuring performative power of online search

Authors: Celestine Mendler-Dünner, Gabriele Carovano, Moritz Hardt

Abstract: The power of digital platforms is at the center of major ongoing policy and regulatory efforts. To advance existing debates, we designed and executed an experiment to measure the power of online search providers, building on the recent definition of performative power. Instantiated in our setting, performative power quantifies the ability of a search engine to steer web traffic by rearranging resu… ▽ More The power of digital platforms is at the center of major ongoing policy and regulatory efforts. To advance existing debates, we designed and executed an experiment to measure the power of online search providers, building on the recent definition of performative power. Instantiated in our setting, performative power quantifies the ability of a search engine to steer web traffic by rearranging results. To operationalize this definition we developed a browser extension that performs unassuming randomized experiments in the background. These randomized experiments emulate updates to the search algorithm and identify the causal effect of different content arrangements on clicks. We formally relate these causal effects to performative power. Analyzing tens of thousands of clicks, we discuss what our robust quantitative findings say about the power of online search engines. More broadly, we envision our work to serve as a blueprint for how performative power and online experiments can be integrated with future investigations into the economic power of digital platforms. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.01719 [pdf, other]

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Authors: Guanhua Zhang, Moritz Hardt

Abstract: We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Ar… ▽ More We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/. △ Less

Submitted 6 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: To be published in ICML 2024

arXiv:2404.02112 [pdf, other]

ImageNot: A contrast with ImageNet preserves model rankings

Authors: Olawale Salaudeen, Moritz Hardt

Abstract: We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model ov… ▽ More We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. We further give evidence that ImageNot has a similar utility as ImageNet for transfer learning purposes. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2402.09891 [pdf, other]

Predictors from causal features do not generalize better to new domains

Authors: Vivian Y. Nastl, Moritz Hardt

Abstract: We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select feature… ▽ More We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select features that have a causal influence on the target of prediction. Our goal is to test the hypothesis that models trained on causal features generalize better across domains. Without exception, we find that predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features. Moreover, even the absolute drop in accuracy from one domain to the other is no better for causal predictors than for models that use all features. If the goal is to generalize to new domains, practitioners might as well train the best possible model on all available features. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: 13 pages, 7 figures

arXiv:2402.02249 [pdf, other]

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Authors: Florian E. Dorner, Moritz Hardt

Abstract: We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on c… ▽ More We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cramér's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound. △ Less

Submitted 3 February, 2024; originally announced February 2024.

Comments: 34 pages, 3 Figures

arXiv:2311.04806 [pdf, other]

The PetShop Dataset -- Finding Causes of Performance Issues across Microservices

Authors: Michaela Hardt, William R. Orchard, Patrick Blöbaum, Shiva Kasiviswanathan, Elke Kirschbaum

Abstract: Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative b… ▽ More Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative benchmarking. Consequently, research groups are compelled to create their own datasets for experimentation. This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. We showcase how this dataset can be used to evaluate the accuracy of a variety of methods spanning different causal and non-causal characterisations of the root cause analysis problem. We hope the new dataset, available at https://github.com/amazon-science/petshop-root-cause-analysis/ enables further development of techniques in this important area. △ Less

Submitted 8 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 22 pages, 6 figures, 10 tables, for associated git repo see https://github.com/amazon-science/petshop-root-cause-analysis/, to be published in Proceedings of Machine Learning Research vol 236, 2024, 3rd Conference on Causal Learning and Reasoning

ACM Class: E.0

arXiv:2310.16608 [pdf, other]

Performative Prediction: Past and Future

Authors: Moritz Hardt, Celestine Mendler-Dünner

Abstract: Predictions in the social world generally influence the target of prediction, a phenomenon known as performativity. Self-fulfilling and self-negating predictions are examples of performativity. Of fundamental importance to economics, finance, and the social sciences, the notion has been absent from the development of machine learning. In machine learning applications, performativity often surfaces… ▽ More Predictions in the social world generally influence the target of prediction, a phenomenon known as performativity. Self-fulfilling and self-negating predictions are examples of performativity. Of fundamental importance to economics, finance, and the social sciences, the notion has been absent from the development of machine learning. In machine learning applications, performativity often surfaces as distribution shift. A predictive model deployed on a digital platform, for example, influences consumption and thereby changes the data-generating distribution. We survey the recently founded area of performative prediction that provides a definition and conceptual framework to study performativity in machine learning. A consequence of performative prediction is a natural equilibrium notion that gives rise to new optimization challenges. Another consequence is a distinction between learning and steering, two mechanisms at play in performative prediction. The notion of steering is in turn intimately related to questions of power in digital markets. We review the notion of performative power that gives an answer to the question how much a platform can steer participants through its predictions. We end on a discussion of future directions, such as the role that performativity plays in contesting algorithmic systems. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2306.15769 [pdf, other]

What Makes ImageNet Look Unlike LAION

Authors: Ali Shirali, Moritz Hardt

Abstract: ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the… ▽ More ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts. △ Less

Submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.07951 [pdf, other]

Questioning the Survey Responses of Large Language Models

Authors: Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner

Abstract: As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models in order to investigate the population represented by their responses. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau and investigate whether they elicit a faithful r… ▽ More As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models in order to investigate the population represented by their responses. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau and investigate whether they elicit a faithful representations of any human population. Using a de-facto standard multiple-choice prompting technique and evaluating 39 different language models using systematic experiments, we establish two dominant patterns: First, models' responses are governed by ordering and labeling biases, leading to variations across models that do not persist after adjusting for systematic biases. Second, models' responses do not contain the entropy variations and statistical signals typically found in human populations. As a result, a binary classifier can almost perfectly differentiate model-generated data from the responses of the U.S. census. At the same time, models' relative alignment with different demographic subgroups can be predicted from the subgroups' entropy, irrespective of the model's training data or training strategy. Taken together, our findings suggest caution in treating models' survey responses as equivalent to those of human populations. △ Less

Submitted 28 February, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.07261 [pdf, other]

Unprocessing Seven Years of Algorithmic Fairness

Authors: André F. Cruz, Moritz Hardt

Abstract: Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by p… ▽ More Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. △ Less

Submitted 15 March, 2024; v1 submitted 12 June, 2023; originally announced June 2023.

Journal ref: ICLR 2024

arXiv:2305.18466 [pdf, other]

Test-Time Training on Nearest Neighbors for Large Language Models

Authors: Moritz Hardt, Yu Sun

Abstract: Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tun… ▽ More Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling. △ Less

Submitted 2 February, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: ICLR final version

arXiv:2305.09565 [pdf, other]

Toward Falsifying Causal Graphs Using a Permutation-Based Test

Authors: Elias Eulig, Atalanti A. Mastakouri, Patrick Blöbaum, Michaela Hardt, Dominik Janzing

Abstract: Understanding the causal relationships among the variables of a system is paramount to explain and control its behaviour. Inferring the causal graph from observational data without interventions, however, requires a lot of strong assumptions that are not always realistic. Even for domain experts it can be challenging to express the causal graph. Therefore, metrics that quantitatively assess the go… ▽ More Understanding the causal relationships among the variables of a system is paramount to explain and control its behaviour. Inferring the causal graph from observational data without interventions, however, requires a lot of strong assumptions that are not always realistic. Even for domain experts it can be challenging to express the causal graph. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an absolute number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a surrogate baseline through node permutations. By comparing the number of inconsistencies with those on the surrogate baseline, we derive an interpretable metric that captures whether the DAG fits significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true DAG is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: 23 pages, 9 figures

arXiv:2305.05832 [pdf, other]

Causal Information Splitting: Engineering Proxy Features for Robustness to Distribution Shifts

Authors: Bijan Mazaheri, Atalanti Mastakouri, Dominik Janzing, Michaela Hardt

Abstract: Statistical prediction models are often trained on data from different probability distributions than their eventual use cases. One approach to proactively prepare for these shifts harnesses the intuition that causal mechanisms should remain invariant between environments. Here we focus on a challenging setting in which the causal and anticausal variables of the target are unobserved. Leaning on i… ▽ More Statistical prediction models are often trained on data from different probability distributions than their eventual use cases. One approach to proactively prepare for these shifts harnesses the intuition that causal mechanisms should remain invariant between environments. Here we focus on a challenging setting in which the causal and anticausal variables of the target are unobserved. Leaning on information theory, we develop feature selection and engineering techniques for the observed downstream variables that act as proxies. We identify proxies that help to build stable models and moreover utilize auxiliary training tasks to answer counterfactual questions that extract stability-enhancing information from proxies. We demonstrate the effectiveness of our techniques on synthetic and real data. △ Less

Submitted 31 July, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: 29th Conference on Uncertainty in Artificial Intelligence (2023)

arXiv:2304.06205 [pdf, other]

Difficult Lessons on Social Prediction from Wisconsin Public Schools

Authors: Juan C. Perdomo, Tolani Britton, Moritz Hardt, Rediet Abebe

Abstract: Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting which students are at risk of dropping out. Despite significant investments in their widespread adoption, there remain large gaps in our understanding of the effi… ▽ More Early warning systems (EWS) are predictive tools at the center of recent efforts to improve graduation rates in public schools across the United States. These systems assist in targeting interventions to individual students by predicting which students are at risk of dropping out. Despite significant investments in their widespread adoption, there remain large gaps in our understanding of the efficacy of EWS, and the role of statistical risk scores in education. In this work, we draw on nearly a decade's worth of data from a system used throughout Wisconsin to provide the first large-scale evaluation of the long-term impact of EWS on graduation outcomes. We present empirical evidence that the prediction system accurately sorts students by their dropout risk. We also find that it may have caused a single-digit percentage increase in graduation rates, though our empirical analyses cannot reliably rule out that there has been no positive treatment effect. Going beyond a retrospective evaluation of DEWS, we draw attention to a central question at the heart of the use of EWS: Are individual risk scores necessary for effectively targeting interventions? We propose a simple mechanism that only uses information about students' environments -- such as their schools, and districts -- and argue that this mechanism can target interventions just as efficiently as the individual risk score-based mechanism. Our argument holds even if individual predictions are highly accurate and effective interventions exist. In addition to motivating this simple targeting mechanism, our work provides a novel empirical backbone for the robust qualitative understanding among education researchers that dropout is structurally determined. Combined, our insights call into question the marginal value of individual predictions in settings where outcomes are driven by high levels of inequality. △ Less

Submitted 18 September, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

arXiv:2302.04989 [pdf, other]

Causal Inference out of Control: Estimating the Steerability of Consumption

Authors: Gary Cheng, Moritz Hardt, Celestine Mendler-Dünner

Abstract: Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abstracts many settings of interest. Focusing on observational designs and exploiting the structure of the problem, we exhibit a set of assumptions for causal identi… ▽ More Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abstracts many settings of interest. Focusing on observational designs and exploiting the structure of the problem, we exhibit a set of assumptions for causal identifiability that significantly weaken the often unrealistic overlap assumptions of standard designs. The key novelty of our approach is to explicitly model the dynamics of consumption over time, viewing the platform as a controller acting on a dynamical system. From this dynamical systems perspective, we are able to show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying steerability of consumption. Our results illustrate the fruitful interplay of control theory and causal inference, which we illustrate with examples from econometrics, macroeconomics, and machine learning. △ Less

Submitted 9 February, 2023; originally announced February 2023.

arXiv:2302.04262 [pdf, other]

Algorithmic Collective Action in Machine Learning

Authors: Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic

Abstract: We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collecti… ▽ More We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: the case of a nonparametric optimal learning algorithm, a parametric risk minimizer, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective's size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform's learning algorithm. △ Less

Submitted 21 June, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: accepted at ICML 2023, camera-ready updates

arXiv:2211.08667 [pdf, other]

County-level Algorithmic Audit of Racial Bias in Twitter's Home Timeline

Authors: Luca Belli, Kyra Yee, Uthaipon Tantipongpipat, Aaron Gonzales, Kristian Lum, Moritz Hardt

Abstract: We report on the outcome of an audit of Twitter's Home Timeline ranking system. The goal of the audit was to determine if authors from some racial groups experience systematically higher impression counts for their Tweets than others. A central obstacle for any such audit is that Twitter does not ordinarily collect or associate racial information with its users, thus prohibiting an analysis at the… ▽ More We report on the outcome of an audit of Twitter's Home Timeline ranking system. The goal of the audit was to determine if authors from some racial groups experience systematically higher impression counts for their Tweets than others. A central obstacle for any such audit is that Twitter does not ordinarily collect or associate racial information with its users, thus prohibiting an analysis at the level of individual authors. Working around this obstacle, we take US counties as our unit of analysis. We associate each user in the United States on the Twitter platform to a county based on available location data. The US Census Bureau provides information about the racial decomposition of the population in each county. The question we investigate then is if the racial decomposition of a county is associated with the visibility of Tweets originating from within the county. Focusing on two racial groups, the Black or African American population and the White population as defined by the US Census Bureau, we evaluate two statistical measures of bias. Our investigation represents the first large-scale algorithmic audit into racial bias on the Twitter platform. Additionally, it illustrates the challenges of measuring racial bias in online platforms without having such information on the users. △ Less

Submitted 10 February, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

arXiv:2210.03165 [pdf, other]

A Theory of Dynamic Benchmarks

Authors: Ali Shirali, Rediet Abebe, Moritz Hardt

Abstract: Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic b… ▽ More Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work. △ Less

Submitted 1 March, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: ICLR 2023 Version

arXiv:2206.11673 [pdf, other]

doi 10.1145/3617694.3623225

Is your model predicting the past?

Authors: Moritz Hardt, Michael P. Kim

Abstract: When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstr… ▽ More When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to what extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning. △ Less

Submitted 10 March, 2024; v1 submitted 23 June, 2022; originally announced June 2022.

Comments: Code available at: https://github.com/socialfoundations/backward_baselines

arXiv:2206.09305 [pdf, other]

doi 10.1145/3531146.3533228

Adversarial Scrutiny of Evidentiary Statistical Software

Authors: Rediet Abebe, Moritz Hardt, Angela Jin, John Miller, Ludwig Schmidt, Rebecca Wexler

Abstract: The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software -- such as probabilistic genotyping, environmental audio detection, and toolmark analysis tools -- that defense counsel cannot fully cross-examine or scrutinize.… ▽ More The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software -- such as probabilistic genotyping, environmental audio detection, and toolmark analysis tools -- that defense counsel cannot fully cross-examine or scrutinize. This undermines the commitments of the adversarial criminal legal system, which relies on the defense's ability to probe and test the prosecution's case to safeguard individual rights. Responding to this need to adversarially scrutinize output from such software, we propose robust adversarial testing as an audit framework to examine the validity of evidentiary statistical software. We define and operationalize this notion of robust adversarial testing for defense use by drawing on a large body of recent work in robust machine learning and algorithmic fairness. We demonstrate how this framework both standardizes the process for scrutinizing such tools and empowers defense lawyers to examine their validity for instances most relevant to the case at hand. We further discuss existing structural and institutional challenges within the U.S. criminal legal system that may create barriers for implementing this and other such audit frameworks and close with a discussion on policy changes that could help address these concerns. △ Less

Submitted 30 September, 2022; v1 submitted 18 June, 2022; originally announced June 2022.

Comments: Typos corrected, appendix B removed

ACM Class: K.4.1; I.2.1; G.3; D.2.5

arXiv:2203.17232 [pdf, other]

Performative Power

Authors: Moritz Hardt, Meena Jagadeesan, Celestine Mendler-Dünner

Abstract: We introduce the notion of performative power, which measures the ability of a firm operating an algorithmic system, such as a digital content recommendation platform, to cause change in a population of participants. We relate performative power to the economic study of competition in digital economies. Traditional economic concepts struggle with identifying anti-competitive patterns in digital pl… ▽ More We introduce the notion of performative power, which measures the ability of a firm operating an algorithmic system, such as a digital content recommendation platform, to cause change in a population of participants. We relate performative power to the economic study of competition in digital economies. Traditional economic concepts struggle with identifying anti-competitive patterns in digital platforms not least due to the complexity of market definition. In contrast, performative power is a causal notion that is identifiable with minimal knowledge of the market, its internals, participants, products, or prices. Low performative power implies that a firm can do no better than to optimize their objective on current data. In contrast, firms of high performative power stand to benefit from steering the population towards more profitable behavior. We confirm in a simple theoretical model that monopolies maximize performative power. A firm's ability to personalize increases performative power, while competition and outside options decrease performative power. On the empirical side, we propose an observational causal design to identify performative power from discontinuities in how digital platforms display content. This allows to repurpose causal effects from various studies about digital platforms as lower bounds on performative power. Finally, we speculate about the role that performative power might play in competition policy and antitrust enforcement in digital marketplaces. △ Less

Submitted 3 November, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: to appear at NeurIPS 2022

arXiv:2203.08074 [pdf, other]

Combining AI/ML and PHY Layer Rule Based Inference -- Some First Results

Authors: Brenda Vilas Boas, Wolfgang Zirwas, Martin Haardt

Abstract: In 3GPP New Radio (NR) Release 18 we see the first study item starting in May 2022, which will evaluate the potential of AI/ML methods for Radio Access Network (RAN) 1, i.e., for mobile radio PHY and MAC layer applications. We use the profiling method for accurate iterative estimation of multipath component parameters for PHY layer reference, as it promises a large channel prediction horizon. We i… ▽ More In 3GPP New Radio (NR) Release 18 we see the first study item starting in May 2022, which will evaluate the potential of AI/ML methods for Radio Access Network (RAN) 1, i.e., for mobile radio PHY and MAC layer applications. We use the profiling method for accurate iterative estimation of multipath component parameters for PHY layer reference, as it promises a large channel prediction horizon. We investigate options to partly or fully replace some functionalities of this rule based PHY layer method by AI/ML inferences, with the goal to achieve either a higher performance, lower latency, or, reduced processing complexity. We provide first results for noise reduction, then a combined scheme for model order selection, compare options to infer multipath component start parameters, and, provide an outlook on a possible channel prediction framework. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: submitted to SPAWC 2022

arXiv:2111.09831 [pdf, other]

Causal Forecasting:Generalization Bounds for Autoregressive Models

Authors: Leena Chennuru Vankadara, Philipp Michael Faller, Michaela Hardt, Lenon Minorics, Debarghya Ghoshdastidar, Dominik Janzing

Abstract: Despite the increasing relevance of forecasting methods, causal implications of these algorithms remain largely unexplored. This is concerning considering that, even under simplifying assumptions such as causal sufficiency, the statistical risk of a model can differ significantly from its \textit{causal risk}. Here, we study the problem of \textit{causal generalization} -- generalizing from the ob… ▽ More Despite the increasing relevance of forecasting methods, causal implications of these algorithms remain largely unexplored. This is concerning considering that, even under simplifying assumptions such as causal sufficiency, the statistical risk of a model can differ significantly from its \textit{causal risk}. Here, we study the problem of \textit{causal generalization} -- generalizing from the observational to interventional distributions -- in forecasting. Our goal is to find answers to the question: How does the efficacy of an autoregressive (VAR) model in predicting statistical associations compare with its ability to predict under interventions? To this end, we introduce the framework of \textit{causal learning theory} for forecasting. Using this framework, we obtain a characterization of the difference between statistical and causal risks, which helps identify sources of divergence between them. Under causal sufficiency, the problem of causal generalization amounts to learning under covariate shifts, albeit with additional structure (restriction to interventional distributions under the VAR model). This structure allows us to obtain uniform convergence bounds on causal generalizability for the class of VAR models. To the best of our knowledge, this is the first work that provides theoretical guarantees for causal generalization in the time-series setting. △ Less

Submitted 8 September, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

arXiv:2111.07858 [pdf, other]

Transfer Learning Capabilities of Untrained Neural Networks for MIMO CSI Recreation

Authors: Brenda Vilas Boas, Wolfgang Zirwas, Martin Haardt

Abstract: Machine learning (ML) applications for wireless communications have gained momentum on the standardization discussions for 5G advanced and beyond. One of the biggest challenges for real world ML deployment is the need for labeled signals and big measurement campaigns. To overcome those problems, we propose the use of untrained neural networks (UNNs) for MIMO channel recreation/estimation and low o… ▽ More Machine learning (ML) applications for wireless communications have gained momentum on the standardization discussions for 5G advanced and beyond. One of the biggest challenges for real world ML deployment is the need for labeled signals and big measurement campaigns. To overcome those problems, we propose the use of untrained neural networks (UNNs) for MIMO channel recreation/estimation and low overhead reporting. The UNNs learn the propagation environment by fitting a few channel measurements and we exploit their learned prior to provide higher channel estimation gains. Moreover, we present a UNN for simultaneous channel recreation for multiple users, or multiple user equipment (UE) positions, in which we have a trade-off between the estimated channel gain and the number of parameters. Our results show that transfer learning techniques are effective in accessing the learned prior on the environment structure as they provide higher channel gain for neighbouring users. Moreover, we indicate how the under-parameterization of UNNs can further enable low-overhead channel state information (CSI) reporting. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: to be published

arXiv:2111.07854 [pdf, other]

Machine Learning for CSI Recreation Based on Prior Knowledge

Authors: Brenda Vilas Boas, Wolfgang Zirwas, Martin Haardt

Abstract: Knowledge of channel state information (CSI) is fundamental to many functionalities within the mobile wireless communications systems. With the advance of machine learning (ML) and digital maps, i.e., digital twins, we have a big opportunity to learn the propagation environment and design novel methods to derive and report CSI. In this work, we propose to combine untrained neural networks (UNNs) a… ▽ More Knowledge of channel state information (CSI) is fundamental to many functionalities within the mobile wireless communications systems. With the advance of machine learning (ML) and digital maps, i.e., digital twins, we have a big opportunity to learn the propagation environment and design novel methods to derive and report CSI. In this work, we propose to combine untrained neural networks (UNNs) and conditional generative adversarial networks (cGANs) for MIMO channel recreation based on prior knowledge. The UNNs learn the prior-CSI for some locations which are used to build the input to a cGAN. Based on the prior-CSIs, their locations and the location of the desired channel, the cGAN is trained to output the channel expected at the desired location. This combined approach can be used for low overhead CSI reporting as, after training, we only need to report the desired location. Our results show that our method is successful in modelling the wireless channel and robust to location quantization errors in line of sight conditions. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: submitted for publication

arXiv:2110.11010 [pdf, other]

doi 10.1073/pnas.2025334119

Algorithmic Amplification of Politics on Twitter

Authors: Ferenc Huszár, Sofia Ira Ktena, Conor O'Brien, Luca Belli, Andrew Schlaikjer, Moritz Hardt

Abstract: Content on Twitter's home timeline is selected and ordered by personalization algorithms. By consistently ranking certain content higher, these algorithms may amplify some messages while reducing the visibility of others. There's been intense public and scholarly debate about the possibility that some political groups benefit more from algorithmic amplification than others. We provide quantitative… ▽ More Content on Twitter's home timeline is selected and ordered by personalization algorithms. By consistently ranking certain content higher, these algorithms may amplify some messages while reducing the visibility of others. There's been intense public and scholarly debate about the possibility that some political groups benefit more from algorithmic amplification than others. We provide quantitative evidence from a long-running, massive-scale randomized experiment on the Twitter platform that committed a randomized control group including nearly 2M daily active accounts to a reverse-chronological content feed free of algorithmic personalization. We present two sets of findings. First, we studied Tweets by elected legislators from major political parties in 7 countries. Our results reveal a remarkably consistent trend: In 6 out of 7 countries studied, the mainstream political right enjoys higher algorithmic amplification than the mainstream political left. Consistent with this overall trend, our second set of findings studying the U.S. media landscape revealed that algorithmic amplification favours right-leaning news sources. We further looked at whether algorithms amplify far-left and far-right political groups more than moderate ones: contrary to prevailing public belief, we did not find evidence to support this hypothesis. We hope our findings will contribute to an evidence-based debate on the role personalization algorithms play in shaping political content consumption. △ Less

Submitted 21 October, 2021; originally announced October 2021.

arXiv:2109.03285 [pdf, other]

doi 10.1145/3447548.3467177

Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud

Authors: Michaela Hardt, Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gelman, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai, Keerthan Vasist, Pinar Yilmaz, Muhammad Bilal Zafar, Sanjiv Das, Kevin Haas, Tyler Hill, Krishnaram Kenthapadi

Abstract: Understanding the predictions made by machine learning (ML) models and their potential biases remains a challenging and labor-intensive task that depends on the application, the dataset, and the specific model. We present Amazon SageMaker Clarify, an explainability feature for Amazon SageMaker that launched in December 2020, providing insights into data and ML models by identifying biases and expl… ▽ More Understanding the predictions made by machine learning (ML) models and their potential biases remains a challenging and labor-intensive task that depends on the application, the dataset, and the specific model. We present Amazon SageMaker Clarify, an explainability feature for Amazon SageMaker that launched in December 2020, providing insights into data and ML models by identifying biases and explaining predictions. It is deeply integrated into Amazon SageMaker, a fully managed service that enables data scientists and developers to build, train, and deploy ML models at any scale. Clarify supports bias detection and feature importance computation across the ML lifecycle, during data preparation, model evaluation, and post-deployment monitoring. We outline the desiderata derived from customer input, the modular architecture, and the methodology for bias and explanation computations. Further, we describe the technical challenges encountered and the tradeoffs we had to make. For illustration, we discuss two customer use cases. We present our deployment results including qualitative customer feedback and a quantitative evaluation. Finally, we summarize lessons learned, and discuss best practices for the successful adoption of fairness and explanation tools in practice. △ Less

Submitted 7 September, 2021; originally announced September 2021.

Journal ref: In Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2974-2983 (2021)

arXiv:2108.04884 [pdf, other]

Retiring Adult: New Datasets for Fair Machine Learning

Authors: Frances Ding, Moritz Hardt, John Miller, Ludwig Schmidt

Abstract: Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult… ▽ More Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions. Our datasets are available at https://github.com/zykls/folktables. △ Less

Submitted 9 January, 2022; v1 submitted 10 August, 2021; originally announced August 2021.

arXiv:2107.08995 [pdf, other]

doi 10.1145/3531146.3533103

Causal Inference Struggles with Agency on Online Platforms

Authors: Smitha Milli, Luca Belli, Moritz Hardt

Abstract: Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users… ▽ More Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a "Catch-22" that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency. △ Less

Submitted 10 May, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

Comments: Accepted to FaccT'22

arXiv:2106.12705 [pdf, other]

Alternative Microfoundations for Strategic Classification

Authors: Meena Jagadeesan, Celestine Mendler-Dünner, Moritz Hardt

Abstract: When reasoning about strategic behavior in a machine learning context it is tempting to combine standard microfoundations of rational agents with the statistical decision theory underlying classification. In this work, we argue that a direct combination of these standard ingredients leads to brittle solution concepts of limited descriptive and prescriptive value. First, we show that rational agent… ▽ More When reasoning about strategic behavior in a machine learning context it is tempting to combine standard microfoundations of rational agents with the statistical decision theory underlying classification. In this work, we argue that a direct combination of these standard ingredients leads to brittle solution concepts of limited descriptive and prescriptive value. First, we show that rational agents with perfect information produce discontinuities in the aggregate response to a decision rule that we often do not observe empirically. Second, when any positive fraction of agents is not perfectly strategic, desirable stable points -- where the classifier is optimal for the data it entails -- cease to exist. Third, optimal decision rules under standard microfoundations maximize a measure of negative externality known as social burden within a broad class of possible assumptions about agent behavior. Recognizing these limitations we explore alternatives to standard microfoundations for binary classification. We start by describing a set of desiderata that help navigate the space of possible assumptions about how agents respond to a decision rule. In particular, we analyze a natural constraint on feature manipulations, and discuss properties that are sufficient to guarantee the robust existence of stable points. Building on these insights, we then propose the noisy response model. Inspired by smoothed analysis and empirical observations, noisy response incorporates imperfection in the agent responses, which we show mitigates the limitations of standard microfoundations. Our model retains analytical tractability, leads to more robust insights about stable points, and imposes a lower social burden at optimality. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: Accepted for publication at ICML 2021

arXiv:2106.11633 [pdf, other]

Machine Learning for Model Order Selection in MIMO OFDM Systems

Authors: Brenda Vilas Boas, Wolfgang Zirwas, Martin Haardt

Abstract: A variety of wireless channel estimation methods, e.g., MUSIC and ESPRIT, rely on prior knowledge of the model order. Therefore, it is important to correctly estimate the number of multipath components (MPCs) which compose such channels. However, environments with many scatterers may generate MPCs which are closely spaced. This clustering of MPCs in addition to noise makes the model order selectio… ▽ More A variety of wireless channel estimation methods, e.g., MUSIC and ESPRIT, rely on prior knowledge of the model order. Therefore, it is important to correctly estimate the number of multipath components (MPCs) which compose such channels. However, environments with many scatterers may generate MPCs which are closely spaced. This clustering of MPCs in addition to noise makes the model order selection task difficult in practice to currently known algorithms. In this paper, we exploit the multidimensional characteristics of MIMO orthogonal frequency division multiplexing (OFDM) systems and propose a machine learning (ML) method capable of determining the number of MPCs with a higher accuracy than state of the art methods in almost coherent scenarios. Moreover, our results show that our proposed ML method has an enhanced reliability. △ Less

Submitted 22 June, 2021; originally announced June 2021.

Comments: to be published

arXiv:2106.09375 [pdf, other]

Recovery under Side Constraints

Authors: Khaled Ardah, Martin Haardt, Tianyi Liu, Frederic Matter, Marius Pesavento, Marc E. Pfetsch

Abstract: This paper addresses sparse signal reconstruction under various types of structural side constraints with applications in multi-antenna systems. Side constraints may result from prior information on the measurement system and the sparse signal structure. They may involve the structure of the sensing matrix, the structure of the non-zero support values, the temporal structure of the sparse represen… ▽ More This paper addresses sparse signal reconstruction under various types of structural side constraints with applications in multi-antenna systems. Side constraints may result from prior information on the measurement system and the sparse signal structure. They may involve the structure of the sensing matrix, the structure of the non-zero support values, the temporal structure of the sparse representationvector, and the nonlinear measurement structure. First, we demonstrate how a priori information in form of structural side constraints influence recovery guarantees (null space properties) using L1-minimization. Furthermore, for constant modulus signals, signals with row-, block- and rank-sparsity, as well as non-circular signals, we illustrate how structural prior information can be used to devise efficient algorithms with improved recovery performance and reduced computational complexity. Finally, we address the measurement system design for linear and nonlinear measurements of sparse signals. Moreover, we discuss the linear mixing matrix design based on coherence minimization. Then we extend our focus to nonlinear measurement systems where we design parallel optimization algorithms to efficiently compute stationary points in the sparse phase retrieval problem with and without dictionary learning. △ Less

Submitted 17 June, 2021; originally announced June 2021.

arXiv:2103.00971 [pdf, other]

Low-Complexity Zero-Forcing Precoding for XL-MIMO Transmissions

Authors: Lucas N. Ribeiro, Stefan Schwarz, Martin Haardt

Abstract: Deploying antenna arrays with an asymptotically large aperture will be central to achieving the theoretical gains of massive MIMO in beyond-5G systems. Such extra-large MIMO (XL-MIMO) systems experience propagation conditions which are not typically observed in conventional massive MIMO systems, such as spatial non-stationarities and near-field propagation. Moreover, standard precoding schemes, su… ▽ More Deploying antenna arrays with an asymptotically large aperture will be central to achieving the theoretical gains of massive MIMO in beyond-5G systems. Such extra-large MIMO (XL-MIMO) systems experience propagation conditions which are not typically observed in conventional massive MIMO systems, such as spatial non-stationarities and near-field propagation. Moreover, standard precoding schemes, such as zero-forcing (ZF), may not apply to XL-MIMO transmissions due to the prohibitive complexity associated with such a large-scale scenario. We propose two novel precoding schemes that aim at reducing the complexity without losing much performance. The proposed schemes leverage a plane-wave approximation and user grouping to obtain a low-complexity approximation of the ZF precoder. Our simulation results show that the proposed schemes offer a possibility for a performance and complexity trade-off compared to the benchmark schemes. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: Submitted to Eusipco 2021

arXiv:2102.05242 [pdf, other]

Patterns, predictions, and actions: A story about machine learning

Authors: Moritz Hardt, Benjamin Recht

Abstract: This graduate textbook on machine learning tells a story of how patterns in data support predictions and consequential actions. Starting with the foundations of decision making, we cover representation, optimization, and generalization as the constituents of supervised learning. A chapter on datasets as benchmarks examines their histories and scientific bases. Self-contained introductions to causa… ▽ More This graduate textbook on machine learning tells a story of how patterns in data support predictions and consequential actions. Starting with the foundations of decision making, we cover representation, optimization, and generalization as the constituents of supervised learning. A chapter on datasets as benchmarks examines their histories and scientific bases. Self-contained introductions to causality, the practice of causal inference, sequential decision making, and reinforcement learning equip the reader with concepts and tools to reason about actions and their consequences. Throughout, the text discusses historical context and societal impact. We invite readers from all backgrounds; some experience with probability, calculus, and linear algebra suffices. △ Less

Submitted 26 October, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: Manuscript submitted to publisher for copy editing

arXiv:2101.09705 [pdf, other]

doi 10.1109/ICCWorkshops50388.2021.9473491

Two-step Machine Learning Approach for Channel Estimation with Mixed Resolution RF Chains

Authors: Brenda Vilas Boas, Wolfgang Zirwas, Martin Haardt

Abstract: Massive MIMO is one of the main features of 5G mobile radio systems. However, it often leads to high cost, size and power consumption. To overcome these issues, the use of constrained radio frequency (RF) frontends has been proposed, as well as novel precoders, e.g., a multi-antenna, greedy, iterative and quantized precoding algorithm (MAGIQ). Nevertheless, the best performance of MAGIQ assumes ac… ▽ More Massive MIMO is one of the main features of 5G mobile radio systems. However, it often leads to high cost, size and power consumption. To overcome these issues, the use of constrained radio frequency (RF) frontends has been proposed, as well as novel precoders, e.g., a multi-antenna, greedy, iterative and quantized precoding algorithm (MAGIQ). Nevertheless, the best performance of MAGIQ assumes accurate channel knowledge per antenna element, for example, from uplink sounding reference signals. In this context, we propose an efficient uplink channel estimator by applying machine learning (ML) algorithms. In a first step a conditional generative adversarial network (cGAN) predicts the radio channels from a limited set of full resolution RF chains to the rest of the low resolution RF chain antenna elements. A long-short term memory (LSTM) neural network extracts further phase information from the low resolution RF chain antenna elements. Our results indicate that our proposed approach is competitive with traditional Unitary tensor-ESPRIT in scenarios with various closely spaced multipath components (MPCs). △ Less

Submitted 24 January, 2021; originally announced January 2021.

Comments: to be published

arXiv:2009.10897 [pdf, other]

Revisiting Design Choices in Proximal Policy Optimization

Authors: Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, Moritz Hardt

Abstract: Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchm… ▽ More Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO. We explain why standard design choices are problematic in these cases, and show that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. We hope that our work serves as a reminder that many algorithmic design choices in reinforcement learning are tied to specific simulation environments. We should not implicitly accept these choices as a standard part of a more general algorithm. △ Less

Submitted 22 September, 2020; originally announced September 2020.

arXiv:2008.12623 [pdf, other]

doi 10.1145/3442188.3445933

From Optimizing Engagement to Measuring Value

Authors: Smitha Milli, Luca Belli, Moritz Hardt

Abstract: Most recommendation engines today are based on predicting user engagement, e.g. predicting whether a user will click on an item or not. However, there is potentially a large gap between engagement signals and a desired notion of "value" that is worth optimizing for. We use the framework of measurement theory to (a) confront the designer with a normative question about what the designer values, (b)… ▽ More Most recommendation engines today are based on predicting user engagement, e.g. predicting whether a user will click on an item or not. However, there is potentially a large gap between engagement signals and a desired notion of "value" that is worth optimizing for. We use the framework of measurement theory to (a) confront the designer with a normative question about what the designer values, (b) provide a general latent variable model approach that can be used to operationalize the target construct and directly optimize for it, and (c) guide the designer in evaluating and revising their operationalization. We implement our approach on the Twitter platform on millions of users. In line with established approaches to assessing the validity of measurements, we perform a qualitative evaluation of how well our model captures a desired notion of "value". △ Less

Submitted 19 July, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

Comments: Published at FAccT'21

arXiv:2006.06887 [pdf, other]

Stochastic Optimization for Performative Prediction

Authors: Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, Moritz Hardt

Abstract: In performative prediction, the choice of a model influences the distribution of future data, typically through actions taken based on the model's predictions. We initiate the study of stochastic optimization for performative prediction. What sets this setting apart from traditional stochastic optimization is the difference between merely updating model parameters and deploying the new model. Th… ▽ More In performative prediction, the choice of a model influences the distribution of future data, typically through actions taken based on the model's predictions. We initiate the study of stochastic optimization for performative prediction. What sets this setting apart from traditional stochastic optimization is the difference between merely updating model parameters and deploying the new model. The latter triggers a shift in the distribution that affects future data, while the former keeps the distribution as is. Assuming smoothness and strong convexity, we prove rates of convergence for both greedily deploying models after each stochastic update (greedy deploy) as well as for taking several updates before redeploying (lazy deploy). In both cases, our bounds smoothly recover the optimal $O(1/k)$ rate as the strength of performativity decreases. Furthermore, they illustrate how depending on the strength of performative effects, there exists a regime where either approach outperforms the other. We experimentally explore the trade-off on both synthetic data and a strategic classification simulator. △ Less

Submitted 19 February, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: published at NeurIPS 2020

arXiv:2003.06740 [pdf, other]

Balancing Competing Objectives with Noisy Data: Score-Based Classifiers for Welfare-Aware Machine Learning

Authors: Esther Rolf, Max Simchowitz, Sarah Dean, Lydia T. Liu, Daniel Björkegren, Moritz Hardt, Joshua Blumenstock

Abstract: While real-world decisions involve many competing objectives, algorithmic decisions are often evaluated with a single objective function. In this paper, we study algorithmic policies which explicitly trade off between a private objective (such as profit) and a public objective (such as social welfare). We analyze a natural class of policies which trace an empirical Pareto frontier based on learned… ▽ More While real-world decisions involve many competing objectives, algorithmic decisions are often evaluated with a single objective function. In this paper, we study algorithmic policies which explicitly trade off between a private objective (such as profit) and a public objective (such as social welfare). We analyze a natural class of policies which trace an empirical Pareto frontier based on learned scores, and focus on how such decisions can be made in noisy or data-limited regimes. Our theoretical results characterize the optimal strategies in this class, bound the Pareto errors due to inaccuracies in the scores, and show an equivalence between optimal strategies and a rich class of fairness-constrained profit-maximizing policies. We then present empirical results in two different contexts -- online content recommendation and sustainable abalone fisheries -- to underscore the applicability of our approach to a wide range of practical decisions. Taken together, these results shed light on inherent trade-offs in using machine learning for decisions that impact social welfare. △ Less

Submitted 15 July, 2020; v1 submitted 14 March, 2020; originally announced March 2020.

arXiv:2002.06673 [pdf, other]

Performative Prediction

Authors: Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, Moritz Hardt

Abstract: When predictions support decisions they may influence the outcome they aim to predict. We call such predictions performative; the prediction influences the target. Performativity is a well-studied phenomenon in policy-making that has so far been neglected in supervised learning. When ignored, performativity surfaces as undesirable distribution shift, routinely addressed with retraining. We devel… ▽ More When predictions support decisions they may influence the outcome they aim to predict. We call such predictions performative; the prediction influences the target. Performativity is a well-studied phenomenon in policy-making that has so far been neglected in supervised learning. When ignored, performativity surfaces as undesirable distribution shift, routinely addressed with retraining. We develop a risk minimization framework for performative prediction bringing together concepts from statistics, game theory, and causality. A conceptual novelty is an equilibrium notion we call performative stability. Performative stability implies that the predictions are calibrated not against past outcomes, but against the future outcomes that manifest from acting on the prediction. Our main results are necessary and sufficient conditions for the convergence of retraining to a performatively stable point of nearly minimal loss. In full generality, performative prediction strictly subsumes the setting known as strategic classification. We thus also give the first sufficient conditions for retraining to overcome strategic feedback effects. △ Less

Submitted 26 February, 2021; v1 submitted 16 February, 2020; originally announced February 2020.

Comments: published at ICML'20; fixed some typos

arXiv:1910.10362 [pdf, other]

Strategic Classification is Causal Modeling in Disguise

Authors: John Miller, Smitha Milli, Moritz Hardt

Abstract: Consequential decision-making incentivizes individuals to strategically adapt their behavior to the specifics of the decision rule. While a long line of work has viewed strategic adaptation as gaming and attempted to mitigate its effects, recent work has instead sought to design classifiers that incentivize individuals to improve a desired quality. Key to both accounts is a cost function that dict… ▽ More Consequential decision-making incentivizes individuals to strategically adapt their behavior to the specifics of the decision rule. While a long line of work has viewed strategic adaptation as gaming and attempted to mitigate its effects, recent work has instead sought to design classifiers that incentivize individuals to improve a desired quality. Key to both accounts is a cost function that dictates which adaptations are rational to undertake. In this work, we develop a causal framework for strategic adaptation. Our causal perspective clearly distinguishes between gaming and improvement and reveals an important obstacle to incentive design. We prove any procedure for designing classifiers that incentivize improvement must inevitably solve a non-trivial causal inference problem. Moreover, we show a similar result holds for designing cost functions that satisfy the requirements of previous work. With the benefit of hindsight, our results show much of the prior work on strategic classification is causal modeling in disguise. △ Less

Submitted 17 February, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: This paper was previously titled "Strategic Adaptation to Classifiers: A Causal Perspective." The current version subsumes all previous versions

arXiv:1909.13231 [pdf, other]

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts

Authors: Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, Moritz Hardt

Abstract: In this paper, we propose Test-Time Training, a general approach for improving the performance of predictive models when training and test data come from different distributions. We turn a single unlabeled test sample into a self-supervised learning problem, on which we update the model parameters before making a prediction. This also extends naturally to data in an online stream. Our simple appro… ▽ More In this paper, we propose Test-Time Training, a general approach for improving the performance of predictive models when training and test data come from different distributions. We turn a single unlabeled test sample into a self-supervised learning problem, on which we update the model parameters before making a prediction. This also extends naturally to data in an online stream. Our simple approach leads to improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts. △ Less

Submitted 1 July, 2020; v1 submitted 29 September, 2019; originally announced September 2019.

Comments: ICML 2020

arXiv:1908.01039 [pdf, other]

Linear Dynamics: Clustering without identification

Authors: Chloe Ching-Yun Hsu, Michaela Hardt, Moritz Hardt

Abstract: Linear dynamical systems are a fundamental and powerful parametric model class. However, identifying the parameters of a linear dynamical system is a venerable task, permitting provably efficient solutions only in special cases. This work shows that the eigenspectrum of unknown linear dynamics can be identified without full system identification. We analyze a computationally efficient and provably… ▽ More Linear dynamical systems are a fundamental and powerful parametric model class. However, identifying the parameters of a linear dynamical system is a venerable task, permitting provably efficient solutions only in special cases. This work shows that the eigenspectrum of unknown linear dynamics can be identified without full system identification. We analyze a computationally efficient and provably convergent algorithm to estimate the eigenvalues of the state-transition matrix in a linear dynamical system. When applied to time series clustering, our algorithm can efficiently cluster multi-dimensional time series with temporal offsets and varying lengths, under the assumption that the time series are generated from linear dynamical systems. Evaluating our algorithm on both synthetic data and real electrocardiogram (ECG) signals, we see improvements in clustering quality over existing baselines. △ Less

Submitted 29 February, 2020; v1 submitted 2 August, 2019; originally announced August 2019.

arXiv:1907.04911 [pdf, other]

Explaining an increase in predicted risk for clinical alerts

Authors: Michaela Hardt, Alvin Rajkomar, Gerardo Flores, Andrew Dai, Michael Howell, Greg Corrado, Claire Cui, Moritz Hardt

Abstract: Much work aims to explain a model's prediction on a static input. We consider explanations in a temporal setting where a stateful dynamical model produces a sequence of risk estimates given an input at each time step. When the estimated risk increases, the goal of the explanation is to attribute the increase to a few relevant inputs from the past. While our formal setup and techniques are general,… ▽ More Much work aims to explain a model's prediction on a static input. We consider explanations in a temporal setting where a stateful dynamical model produces a sequence of risk estimates given an input at each time step. When the estimated risk increases, the goal of the explanation is to attribute the increase to a few relevant inputs from the past. While our formal setup and techniques are general, we carry out an in-depth case study in a clinical setting. The goal here is to alert a clinician when a patient's risk of deterioration rises. The clinician then has to decide whether to intervene and adjust the treatment. Given a potentially long sequence of new events since she last saw the patient, a concise explanation helps her to quickly triage the alert. We develop methods to lift static attribution techniques to the dynamical setting, where we identify and address challenges specific to dynamics. We then experimentally assess the utility of different explanations of clinical alerts through expert evaluation. △ Less

Submitted 10 July, 2019; originally announced July 2019.

arXiv:1905.12580 [pdf, other]

Model Similarity Mitigates Test Set Overuse

Authors: Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, Benjamin Recht

Abstract: Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We pr… ▽ More Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the ImageNet ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings. △ Less

Submitted 29 May, 2019; originally announced May 2019.

Comments: 18 pages, 7 figures

arXiv:1905.10360 [pdf, other]

The advantages of multiple classes for reducing overfitting from test set reuse

Authors: Vitaly Feldman, Roy Frostig, Moritz Hardt

Abstract: Excessive reuse of holdout data can lead to overfitting. However, there is little concrete evidence of significant overfitting due to holdout reuse in popular multiclass benchmarks today. Known results show that, in the worst-case, revealing the accuracy of $k$ adaptively chosen classifiers on a data set of size $n$ allows to create a classifier with bias of $Θ(\sqrt{k/n})$ for any binary predicti… ▽ More Excessive reuse of holdout data can lead to overfitting. However, there is little concrete evidence of significant overfitting due to holdout reuse in popular multiclass benchmarks today. Known results show that, in the worst-case, revealing the accuracy of $k$ adaptively chosen classifiers on a data set of size $n$ allows to create a classifier with bias of $Θ(\sqrt{k/n})$ for any binary prediction problem. We show a new upper bound of $\tilde O(\max\{\sqrt{k\log(n)/(mn)},k/n\})$ on the worst-case bias that any attack can achieve in a prediction problem with $m$ classes. Moreover, we present an efficient attack that achieve a bias of $Ω(\sqrt{k/(m^2 n)})$ and improves on previous work for the binary setting ($m=2$). We also present an inefficient attack that achieves a bias of $\tildeΩ(k/n)$. Complementing our theoretical work, we give new practical attacks to stress-test multiclass benchmarks by aiming to create as large a bias as possible with a given number of queries. Our experiments show that the additional uncertainty of prediction with a large number of classes indeed mitigates the effect of our best attacks. Our work extends developments in understanding overfitting due to adaptive data analysis to multiclass prediction problems. It also bears out the surprising fact that multiclass prediction problems are significantly more robust to overfitting when reusing a test (or holdout) dataset. This offers an explanation as to why popular multiclass prediction benchmarks, such as ImageNet, may enjoy a longer lifespan than what intuition from literature on binary classification suggests. △ Less

Submitted 24 May, 2019; originally announced May 2019.

arXiv:1902.04698 [pdf, other]

Identity Crisis: Memorization and Generalization under Extreme Overparameterization

Authors: Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer

Abstract: We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two for… ▽ More We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases. For example, CNNs of up to 10 layers are able to generalize from a single example, whereas FCNs cannot learn the identity function reliably from 60k examples. Deeper CNNs often fail, but nonetheless do astonishing work to memorize the training output: because CNN biases are location invariant, the model must progressively grow an output pattern from the image boundaries via the coordination of many layers. Our work helps to quantify and visualize the sensitivity of inductive biases to architectural choices such as depth, kernel width, and number of channels. △ Less

Submitted 8 January, 2020; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: ICLR 2020

arXiv:1901.11143 [pdf, ps, other]

Natural Analysts in Adaptive Data Analysis

Authors: Tijana Zrnic, Moritz Hardt

Abstract: Adaptive data analysis is frequently criticized for its pessimistic generalization guarantees. The source of these pessimistic bounds is a model that permits arbitrary, possibly adversarial analysts that optimally use information to bias results. While being a central issue in the field, still lacking are notions of natural analysts that allow for more optimistic bounds faithful to the reality tha… ▽ More Adaptive data analysis is frequently criticized for its pessimistic generalization guarantees. The source of these pessimistic bounds is a model that permits arbitrary, possibly adversarial analysts that optimally use information to bias results. While being a central issue in the field, still lacking are notions of natural analysts that allow for more optimistic bounds faithful to the reality that typical analysts aren't adversarial. In this work, we propose notions of natural analysts that smoothly interpolate between the optimal non-adaptive bounds and the best-known adaptive generalization bounds. To accomplish this, we model the analyst's knowledge as evolving according to the rules of an unknown dynamical system that takes in revealed information and outputs new statistical queries to the data. This allows us to restrict the analyst through different natural control-theoretic notions. One such notion corresponds to a recency bias, formalizing an inability to arbitrarily use distant information. Another complementary notion formalizes an anchoring bias, a tendency to weight initial information more strongly. Both notions come with quantitative parameters that smoothly interpolate between the non-adaptive case and the fully adaptive case, allowing for a rich spectrum of intermediate analysts that are neither non-adaptive nor adversarial. Natural not only from a cognitive perspective, we show that our notions also capture standard optimization methods, like gradient descent in various settings. This gives a new interpretation to the fact that gradient descent tends to overfit much less than its adaptive nature might suggest. △ Less

Submitted 11 May, 2019; v1 submitted 30 January, 2019; originally announced January 2019.

Comments: 22 pages

Showing 1–50 of 104 results for author: Hardt, M