Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Philip Stark

    Philip Stark

    Many widely used models amount to an elaborate means of making up numbers—but once a number has been produced, it tends to be taken seriously and its source (the model) is rarely examined carefully. Many widely used models have little... more
    Many widely used models amount to an elaborate means of making up numbers—but once a number has been produced, it tends to be taken seriously and its source (the model) is rarely examined carefully. Many widely used models have little connection to the real-world phenomena they purport to explain. Common steps in modeling to support policy decisions, such as putting disparate things on the same scale, may conflict with reality. Not all costs and benefits can be put on the same scale, not all uncertainties can be expressed as probabilities, and not all model parameters measure what they purport to measure. These ideas are illustrated with examples from seismology, wind-turbine bird deaths, soccer penalty cards, gender bias in academia, and climate policy.
    We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate... more
    We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate elections. We have developed prototype code for all of them and tested it on preference data from the 2016 election.
    The pseudo-random number generators (PRNGs), sampling algorithms, and algorithms for generating random integers in some common statistical packages and programming languages are unnecessarily inaccurate, by an amount that may matter for... more
    The pseudo-random number generators (PRNGs), sampling algorithms, and algorithms for generating random integers in some common statistical packages and programming languages are unnecessarily inaccurate, by an amount that may matter for statistical inference. Most use PRNGs with state spaces that are too small for contemporary sampling problems and methods such as the bootstrap and permutation tests. The random sampling algorithms in many packages rely on the false assumption that PRNGs produce IID U[0, 1) outputs. The discreteness of PRNG outputs and the limited state space of common PRNGs cause those algorithms to perform poorly in practice. Statistics packages and scientific programming languages should use cryptographically secure PRNGs by default (not for their security properties, but for their statistical ones), and offer weaker PRNGs only as an option. Software should not use methods that assume PRNG outputs are IID U[0,1) random variables, such as generating a random sample...
    Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify `chance success' is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as... more
    Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify `chance success' is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as random. The null distribution of the number of successful predictions -- or any other test statistic -- is taken to be its distribution when the fixed set of predictions is applied to random seismicity. Such tests tacitly assume that the predictions do not depend on the observed seismicity. Conditioning on the predictions in this way sets a low hurdle for statistical significance. Consider this scheme: When an earthquake of magnitude 5.5 or greater occurs anywhere in the world, predict that an earthquake at least as large will occur within 21 days and within an epicentral distance of 50 km. We apply this rule to the Harvard centroid-moment-tensor (CMT) catalog for 2000--2004 to generate a set of predictions. The null hypothesis is that earthquake ti...
    Consider approximating a "black box" function f by an emulator f̂ based on n noiseless observations of f. Let w be a point in the domain of f. How big might the error |f̂(w) - f(w)| be? If f could be arbitrarily rough, this... more
    Consider approximating a "black box" function f by an emulator f̂ based on n noiseless observations of f. Let w be a point in the domain of f. How big might the error |f̂(w) - f(w)| be? If f could be arbitrarily rough, this error could be arbitrarily large: we need some constraint on f besides the data. Suppose f is Lipschitz with known constant. We find a lower bound on the number of observations required to ensure that for the best emulator f̂ based on the n data, |f̂(w) - f(w)| <ϵ. But in general, we will not know whether f is Lipschitz, much less know its Lipschitz constant. Assume optimistically that f is Lipschitz-continuous with the smallest constant consistent with the n data. We find the maximum (over such regular f) of |f̂(w) - f(w)| for the best possible emulator f̂; we call this the "mini-minimax uncertainty" at w. In reality, f might not be Lipschitz or---if it is---it might not attain its Lipschitz constant on the data. Hence, the mini-minimax un...
    The uncertainties associated with mathematical models that assess the costs and benefits of climate change policy options are unknowable. Such models can be valuable guides to scientific inquiry, but they should not be used to guide... more
    The uncertainties associated with mathematical models that assess the costs and benefits of climate change policy options are unknowable. Such models can be valuable guides to scientific inquiry, but they should not be used to guide climate policy decisions. In the polarized climate change debate, cost-benefit analyses of policy options are taking on an increasingly influential role. These analyses have been presented by authoritative scholars as a useful contribution to the debate. But models of climate—and especially models of the impact of climate policy—are theorists' tools, not policy justification tools. The methods used to appraise model uncertainties give optimistic lower bounds on the true uncertainty, at best. Even in the finest modeling exercises, uncertainty in model structure is presented as known and manageable, when it is likely neither. Numbers arising from these modeling exercises should therefore not be presented as " facts " providing support to policy decisions. Building more complex models of climate will not necessarily reduce the uncertainties. Indeed, if previous experience is a guide, such models will reveal that current uncertainty estimates are unrealistically small. The fate of the evidence Climate change is the quintessential " wicked problem: " a knot in the uncomfortable area where uncertainty and disagreement about values affect the very framing of what the problem is. The issue of climate change has become so resonant and fraught that it speaks directly to our individual political and cultural identities. Scientists and other scholars often use non-scientific and value-laden rhetoric to emphasize to non-expert audiences what they believe to be the implications of their knowledge. For example, in Modelling the Climate System: An Overview, Gabriele Gramelsberger and Johann Feichter—after a sober discussion of statistical methods applicable to climate models—observe that " if mankind is unable to decide how to frame an appropriate response to climate change, nature will decide for both—environmental and economic calamities—as the economy is inextricably interconnected with the climate. " Historians Naomi Oreskes and Erik M. Conway, in their recent book The Collapse of Western Civilization (2014), paint an apocalyptic picture of the next 80 years, beginning with the " year of perpetual summer " in 2023, and mass-imprisonment of " alarmist " scientists in 2025. Estimates of the impact of climate change turn out to be far too cautious: global temperatures increase dramatically and the sea level rises by eight meters, resulting in plagues of devastating diseases and insects, mass-extinction, the overthrow of governments, and the annihilation of the human populations of Africa and Australia. In the aftermath, survivors take the names of climate scientists as their middle names in recognition of their heroic attempts to warn the world. That the Earth's climate is changing, partly or largely because of anthropogenic emissions of CO 2 and other
    Author(s): Benaloh, Josh; Rivest, Ronald; Ryan, Peter YA; Stark, Philip; Teague, Vanessa; Vora, Poorvi | Abstract: This pamphlet describes end-to-end election verifiability (E2E-V) for a nontechnical audience: election officials, public... more
    Author(s): Benaloh, Josh; Rivest, Ronald; Ryan, Peter YA; Stark, Philip; Teague, Vanessa; Vora, Poorvi | Abstract: This pamphlet describes end-to-end election verifiability (E2E-V) for a nontechnical audience: election officials, public policymakers, and anyone else interested in secure, transparent, evidence-based electronic elections. This work is part of the Overseas Vote Foundation's End-to-End Verifiable Internet Voting: Specification and Feasibility Assessment Study (E2E VIV Project), funded by the Democracy Fund.
    Significance: Foraged leafy greens are consumed around the globe, including in urban areas, and may play a larger role when food is scarce or expensive. It is thus important to assess the safety and nutritional value of wild greens... more
    Significance: Foraged leafy greens are consumed around the globe, including in urban areas, and may play a larger role when food is scarce or expensive. It is thus important to assess the safety and nutritional value of wild greens foraged in urban environments. Methods: Field observations, soil tests, and nutritional and toxicology tests on plant tissue were conducted for three sites, each roughly 9 square blocks, in disadvantaged neighborhoods in the East San Francisco Bay Area in 2014--2015. The sites included mixed-use areas and areas with high vehicle traffic. Results: Edible wild greens were abundant, even during record droughts. Soil at some survey sites had elevated concentrations of lead and cadmium, but tissue tests suggest that rinsed greens of the tested species are safe to eat. Daily consumption of standard servings comprise less than the EPA reference doses of lead, cadmium, and other heavy metals. Pesticides, glyphosate, and PCBs were below detection limits. The nutri...
    Many voter-verifiable, coercion-resistant schemes have been proposed, but even the most carefully designed systems necessarily leak information via the announced result. In corner cases, this may be problematic. For example, if all the... more
    Many voter-verifiable, coercion-resistant schemes have been proposed, but even the most carefully designed systems necessarily leak information via the announced result. In corner cases, this may be problematic. For example, if all the votes go to one candidate then all vote privacy evaporates. The mere possibility of candidates getting no or few votes could have implications for security in practice: if a coercer demands that a voter cast a vote for such an unpopular candidate, then the voter may feel obliged to obey, even if she is confident that the voting system satisfies the standard coercion resistance definitions. With complex ballots, there may also be a danger of "Italian" style (aka "signature") attacks: the coercer demands the voter cast a ballot with a specific, identifying pattern. Here we propose an approach to tallying end-to-end verifiable schemes that avoids revealing all the votes but still achieves whatever confidence level in the announced res...
    There are many sources of error in counting votes: the apparent winner might not be the rightful winner. Hand tallies of the votes in a random sample of precincts can be used to test the hypothesis that a full manual recount would find a... more
    There are many sources of error in counting votes: the apparent winner might not be the rightful winner. Hand tallies of the votes in a random sample of precincts can be used to test the hypothesis that a full manual recount would find a different outcome. This paper develops a conservative sequential test based on the vote-counting errors found in a hand tally of a simple or stratified random sample of precincts. The procedure includes a natural escalation: If the hypothesis that the apparent outcome is incorrect is not rejected at stage s, more precincts are audited. Eventually, either the hypothesis is rejected--and the apparent outcome is confirmed--or all precincts have been audited and the true outcome is known. The test uses a priori bounds on the overstatement of the margin that could result from error in each precinct. Such bounds can be derived from the reported counts in each precinct and upper bounds on the number of votes cast in each precinct. The test allows errors in...
    2 Student ratings of teaching have been used, studied, and debated for almost a century. This article examines student ratings of teaching from a statistical perspective. The common practice of relying on averages of student teaching... more
    2 Student ratings of teaching have been used, studied, and debated for almost a century. This article examines student ratings of teaching from a statistical perspective. The common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness for promotion and tenure decisions should be abandoned for substantive and statistical reasons: There is strong evidence that student responses to questions of "effectiveness" do not measure teaching effectiveness. Response rates and response variability matter. And comparing averages of categorical responses, even if the categories are represented by numbers, makes little sense. Student ratings of teaching are valuable when they ask the right questions, report response rates and score distributions, and are balanced by a variety of other sources and methods to evaluate teaching. Since 1975, course evaluations at University of California, Berkeley have asked: Considering both t...
    The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District... more
    The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District Attorney's race in November, 2019. We found that the vote-by-mail outcome could be efficiently audited to well under the 0.05 risk limit given a sample of only 200 ballots. All the software we developed for the pilot is open source.
    An evaluation of course evaluations Student ratings of teaching have been used, studied, and debated for almost a century. This article examines student ratings of teaching from a statistical perspective. The common practice of relying on... more
    An evaluation of course evaluations Student ratings of teaching have been used, studied, and debated for almost a century. This article examines student ratings of teaching from a statistical perspective. The common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness for promotion and tenure decisions should be abandoned for substantive and statistical reasons: There is strong evidence that student responses to questions of “effectiveness ” do not measure teaching effectiveness. Response rates and response variability matter. And comparing averages of categorical responses, even if the categories are represented by numbers, makes little sense. Student ratings of teaching are valuable when they ask the right questions, report response rates and score distributions, and are balanced by a variety of other sources and methods to evaluate teaching.
    Abstract. What mathematicians, scientists, engineers, and statisticians mean by \inverse problem " di ers. For a statistician, an inverse problem is an inference or estimation problem. The data are nite in number and contain errors,... more
    Abstract. What mathematicians, scientists, engineers, and statisticians mean by \inverse problem " di ers. For a statistician, an inverse problem is an inference or estimation problem. The data are nite in number and contain errors, as they do in classical estimation or inference problems, and the unknown typically is in nite-dimensional, as it is in nonparametric regres-sion. The additional complication in an inverse problem is that the data are only indirectly related to the unknown. Canonical abstract formulations of statistical estimation problems subsume this complication by allowing probability distributions to be indexed in more-or-less arbitrary ways by parameters, which can be in nite-dimensional. Standard statistical concepts, questions, and considerations such as bias, variance, mean-squared error, identi ability, con-sistency, e ciency, and various forms of optimality, apply to inverse problems. This article discusses inverse problems as statistical estimation and i...
    The U.S. Census tries to enumerate all residents of the U.S., block by block, every ten years. (A block is the smallest unit of census geography; the area of blocks varies with population density: There are about 7 million blocks in the... more
    The U.S. Census tries to enumerate all residents of the U.S., block by block, every ten years. (A block is the smallest unit of census geography; the area of blocks varies with population density: There are about 7 million blocks in the U.S.) State and sub-state counts matter for apportioning the House of Representatives, allocating Federal funds, congressional redistricting, urban planning, and so forth. Counting the population is difficult, and two kinds of error occur: gross omissions (GOs) and erroneous enumerations (EEs). A GO results from failing to count a person; an EE results from counting a person in error. Counting a person in the wrong block creates both a GO and an EE. Generally, GOs slightly exceed EEs, producing an undercount that is uneven demographically and geographically. In 1980, 1990, and 2000, the U.S. Census Bureau tried unsuccessfully to adjust census counts to reduce differential undercount using Dual Systems Estimation (DSE), a method based on CAPTURE-RECAP...
    U.S. elections rely heavily on computers which introduce digital threats to election outcomes. Risk-limiting audits (RLAs) mitigate threats to some of these systems by manually inspecting random samples of ballot cards. RLAs have a large... more
    U.S. elections rely heavily on computers which introduce digital threats to election outcomes. Risk-limiting audits (RLAs) mitigate threats to some of these systems by manually inspecting random samples of ballot cards. RLAs have a large chance of correcting wrong outcomes (by conducting a full manual tabulation of a trustworthy record of the votes), but can save labor when reported outcomes are correct. This efficiency is eroded when sampling cannot be targeted to ballot cards that contain the contest(s) under audit. States that conduct RLAs of contests on multi-card ballots or of small contests can dramatically reduce sample sizes by using information about which ballot cards contain which contests---by keeping track of card-style data (CSD). For instance, CSD reduces the expected number of draws needed to audit a single countywide contest on a 4-card ballot by 75%. Similarly, CSD reduces the expected number of draws by 95% or more for an audit of two contests with the same margin...
    A collection of races in a single election can be audited as a group by auditing a random sample of batches of ballots and combining observed discrepancies in the races represented in those batches in a par- ticular way: the maximum... more
    A collection of races in a single election can be audited as a group by auditing a random sample of batches of ballots and combining observed discrepancies in the races represented in those batches in a par- ticular way: the maximum across-race relative overstatement of pairwise margins (MARROP). A risk-limiting audit for the entire collection of races can be built on this ballot-based auditing using a variety of probability sam- pling schemes. The audit controls the familywise error rate (the chance that one or more incorrect outcomes fails to be corrected by a full hand count) at a cost that can be lower than that of controlling the per-comparison error rate with independent audits. The approach is particularly ecient if batches are drawn with probability proportional to a bound on the MAR- ROP (PPEB sampling).
    A revised plan for the 2000 Decennial Census was announced in a 24 February 1999 Bureau of the Census publication 99 and a press statement b y K. Prewitt, Director of the Bureau of the Census 39. Census 2000 will include counts and... more
    A revised plan for the 2000 Decennial Census was announced in a 24 February 1999 Bureau of the Census publication 99 and a press statement b y K. Prewitt, Director of the Bureau of the Census 39. Census 2000 will include counts and adjusted" counts. The adjustments involve complicated procedures and calculations on data from a sample of blocks, extrapolated throughout the country to demographic groups called post-strata." The 2000 adjustment plan is called Accuracy and Coverage Evaluation ACE. ACE is quite similar to the 1990 adjustment plan, called the Post-Enumeration Survey PES. The 1990 PES fails some plausibility c hecks 4, 1 2 , 444 and probably would have reduced the accuracy of counts and state shares 3, 4. ACE and PES diier in sample size, data capture, timing, record matching, post-stratiication, methods to compensate for missing data, the treatment o f m o vers, and ACE improves on PES in a number of ways, including using a larger sample, using a simpler model t...
    A revised plan for the 2000 Decennial Census was announced in a 24 February 1999 Bureau of the Census publication and a press statement b y K. Prewitt, Director of the Bureau of the Census. Census 2000 will include counts and... more
    A revised plan for the 2000 Decennial Census was announced in a 24 February 1999 Bureau of the Census publication and a press statement b y K. Prewitt, Director of the Bureau of the Census. Census 2000 will include counts and adjusted" counts. The adjustments involve complicated procedures and calculations on data from a sample of blocks, extrapolated throughout the country to demographic groups called post-strata." The 2000 adjustment plan is called Accuracy and Coverage Evaluation ACE. ACE is quite similar to the 1990 adjustment plan, called the Post-Enumeration Survey PES. The 1990 PES fails some plausibility c hecks and might w ell have h a ve reduced the accuracy of counts and state shares. ACE and PES diier in sample size, data capture, timing, record matching, post-stratiication, methods to compensate for missing data, the treatment o f m o vers, and details of the data analysis. ACE improves on PES in a numb e r o f w ays, including using a larger sample, using a s...
    Direct recording electronic (DRE) voting systems have been shown time and time again to be vulnerable to hacking and malfunctioning. Despite mounting evidence that DREs are unfit for use, some states in the U.S. continue to use them for... more
    Direct recording electronic (DRE) voting systems have been shown time and time again to be vulnerable to hacking and malfunctioning. Despite mounting evidence that DREs are unfit for use, some states in the U.S. continue to use them for local, state, and federal elections. Georgia uses DREs exclusively, among many practices that have made its elections unfair and insecure. We give a brief history of election security and integrity in Georgia from the early 2000s to the 2018 election. Nonparametric permutation tests give strong evidence that something caused DREs not to record a substantial number of votes in this election. The undervote rate in the Lieutenant Governor’s race was far higher for voters who used DREs than for voters who used paper ballots. Undervote rates were strongly associated with ethnicity, with higher undervote rates in precincts where the percentage of Black voters was higher. There is specific evidence of DRE malfunction, too: one of the seven DREs in the Winte...
    We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate... more
    We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate elections. We have developed prototype code for all of them and tested it on preference data from the 2016 election.
    For references, see J. Shaffer (1995) Multiple Hypothesis Testing, Ann. Rev. Psychol., 46, 561-584; J. Hsu (1996) Multiple Comparisons: Theory and Methods, Chapman and Hall, London. It is often the case that one wishes to test not just... more
    For references, see J. Shaffer (1995) Multiple Hypothesis Testing, Ann. Rev. Psychol., 46, 561-584; J. Hsu (1996) Multiple Comparisons: Theory and Methods, Chapman and Hall, London. It is often the case that one wishes to test not just one, but several or many hypotheses. For example, one might be evaluating a collection of drugs, and want to test the family of null hypotheses that each is not effective. Suppose one tests each of these null hypotheses at level α. This level is called the “per-comparison error rate” (PCER). Clearly, the chance of making at least one Type I error is at least α, and is typically larger. Let {Hj}j=1 (m for multiplicity) be the family of null hypotheses to be tested, and let H = ∩jHj be the “grand null hypothesis.” If H is true, the expected number of rejections is αm. The “familywise error rate” (FWER) is the probability of one or more incorrect rejections:
    We propose a family of novel social choice functions. Our goal is to explore social choice functions for which ease of auditing is a primary design goal, instead of being ignored or left as a puzzle to solve later.
    The pseudo-random number generators (PRNGs), sampling algorithms, and algorithms for generating random integers in some common statistical packages and programming languages are unnecessarily inaccurate, by an amount that may matter for... more
    The pseudo-random number generators (PRNGs), sampling algorithms, and algorithms for generating random integers in some common statistical packages and programming languages are unnecessarily inaccurate, by an amount that may matter for statistical inference. Most use PRNGs with state spaces that are too small for contemporary sampling problems and methods such as the bootstrap and permutation tests. The random sampling algorithms in many packages rely on the false assumption that PRNGs produce IID $U[0, 1)$ outputs. The discreteness of PRNG outputs and the limited state space of common PRNGs cause those algorithms to perform poorly in practice. Statistics packages and scientific programming languages should use cryptographically secure PRNGs by default (not for their security properties, but for their statistical ones), and offer weaker PRNGs only as an option. Software should not use methods that assume PRNG outputs are IID $U[0,1)$ random variables, such as generating a random sa...
    We provide Risk Limiting Audits for proportional representation election systems such as D’Hondt and Sainte-Lague. These techniques could be used to produce evidence of correct (electronic) election outcomes in Denmark, Luxembourg,... more
    We provide Risk Limiting Audits for proportional representation election systems such as D’Hondt and Sainte-Lague. These techniques could be used to produce evidence of correct (electronic) election outcomes in Denmark, Luxembourg, Estonia, Norway, and many other countries.
    The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District... more
    The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District Attorney's race in November, 2019. We found that the vote-by-mail outcome could be efficiently audited to well under the 0.05 risk limit given a sample of only 200 ballots. All the software we developed for the pilot is open source.
    Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify ‘chance success’ is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as... more
    Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify ‘chance success’ is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as random. The null distribution of the number of successful predictions – or any other test statistic – is taken to be its distribution when the fixed set of predictions is applied to random seismicity. Such tests tacitly assume that the predictions do not depend on the observed seismicity. Conditioning on the predictions in this way sets a low hurdle for statistical significance. Consider this scheme: When an earthquake of magnitude 5.5 or greater occurs anywhere in the world, predict that an earthquake at least as large will occur within 21 days and within an epicentral distance of 50 km. We apply this rule to the Harvard centroid-moment-tensor (CMT) catalog for 2000–2004 to generate a set of predictions. The null hypothesis is that earthquake times are...
    Post-election audits can provide convincing evidence that election outcomes are correct—that the reported winner(s) really won—by manually inspecting ballots selected at random from a trustworthy paper trail of votes. Risk-limiting audits... more
    Post-election audits can provide convincing evidence that election outcomes are correct—that the reported winner(s) really won—by manually inspecting ballots selected at random from a trustworthy paper trail of votes. Risk-limiting audits (RLAs) control the probability that, if the reported outcome is wrong, it is not corrected before the outcome becomes official. RLAs keep this probability below the specified “risk limit.” Bayesian audits (BAs) control the probability that the reported outcome is wrong, the “upset probability.” The upset probability does not exist unless one posits a prior probability distribution for cast votes. RLAs ensure that if this election’s reported outcome is wrong, the procedure has a large chance of correcting it. BAs control a weighted average probability of correcting wrong outcomes over a hypothetical collection of elections; the weights come from the prior. There are priors for which the upset probability is equal to the risk, but in general, BAs do ...
    References: Daubechies, I. 1992. Ten lectures on wavelets , SIAM, Philadelphia, PA. Donoho, D.L., I.M. Johnstone, G. Kerkyacharian, and D. Picard, 1993. Density estimation by wavelet thresholding. http://www-stat.stanford.edu/... more
    References: Daubechies, I. 1992. Ten lectures on wavelets , SIAM, Philadelphia, PA. Donoho, D.L., I.M. Johnstone, G. Kerkyacharian, and D. Picard, 1993. Density estimation by wavelet thresholding. http://www-stat.stanford.edu/ donoho/Reports/1993/dens.pdf Evans, S.N. and P.B. Stark, 2002. Inverse problems as statistics, Inverse Problems , 18 , R55–R97. Hengartner, N.W. and P.B. Stark, 1995. Finite-sample confidence envelopes for shape-restricted densities. Ann. Stat., 23 , pp. 525–550. Silverman, B.W., 1990. Density Estimation for Statistics and Data Analysis , Chapman and Hall, London.
    This note presents three ways of constructing simultaneous condence intervals for linear estimates of linear functionals in inverse problems, including \Backus-Gilbert" estimates. Simultaneous con dence intervals are needed to... more
    This note presents three ways of constructing simultaneous condence intervals for linear estimates of linear functionals in inverse problems, including \Backus-Gilbert" estimates. Simultaneous con dence intervals are needed to compare estimates, for example, to nd spatial variations in a distributed parameter. The notion of simultaneous con dence intervals is introduced using coin tossing as an example before moving to linear inverse problems. The rst method for constructing simultaneous con dence intervals is based on the Bonferroni inequality, and applies generally to con dence intervals for any set of parameters, from dependent or independent observations. The second method for constructing simultaneous con dence intervals in inverse problems is based on a \global" measure of t to the data, which allows one to compute simultaneous con dence intervals for any number of linear functionals of the model that are linear combinations of the data mappings. This leads to con de...
    Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: SET are biased against female instructors by an amount that is large and statistically significant the... more
    Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: SET are biased against female instructors by an amount that is large and statistically significant the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded the bias varies by discipline and by student gender, among other things it is not possible to adjust for the bias, because it depends on so many factors SET are more sensitive to students' gender bias and grade expectations than they are to teaching effectiveness gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors.These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sec...
    Student ratings of teaching have been used, studied, and debated for almost a century. This article examines student ratings of teaching from a statistical perspective. The common practice of relying on averages of student teaching... more
    Student ratings of teaching have been used, studied, and debated for almost a century. This article examines student ratings of teaching from a statistical perspective. The common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness for promotion and tenure decisions should be abandoned for substantive and statistical reasons: There is strong evidence that student responses to questions of “effectiveness” do not measure teaching effectiveness. Response rates and response variability matter. And comparing averages of categorical responses, even if the categories are represented by numbers, makes little sense. Student ratings of teaching are valuable when they ask the right questions, report response rates and score distributions, and are balanced by a variety of other sources and methods to evaluate teaching.
    Research Interests:
    Research Interests:

    And 153 more