Search | arXiv e-print repository

Delving into ChatGPT usage in academic writing through excess vocabulary

Authors: Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause

Abstract: Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in th… ▽ More Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in the academic literature currently? To answer this question, we use an unbiased, large-scale approach, free from any assumptions on academic LLM usage. We study vocabulary changes in 14 million PubMed abstracts from 2010-2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. Our analysis based on excess words usage suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, and was as high as 30% for some PubMed sub-corpora. We show that the appearance of LLM-based writing assistants has had an unprecedented impact in the scientific literature, surpassing the effect of major world events such as the Covid pandemic. △ Less

Submitted 3 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: v2: Updating dataset, figures and numbers to include all PubMed abstracts until end of June 2024

arXiv:2404.08403 [pdf, other]

Learning representations of learning representations

Authors: Rita González-Márquez, Dmitry Kobak

Abstract: The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transfor… ▽ More The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of $k$NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify hedgehogs and foxes among the authors with the highest number of ICLR submissions. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Journal ref: DMLR workshop at ICLR 2024

arXiv:2402.14566 [pdf, other]

Self-supervised Visualisation of Medical Image Datasets

Authors: Ifeoma Veronica Nwabufo, Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

Abstract: Self-supervised learning methods based on data augmentations, such as SimCLR, BYOL, or DINO, allow obtaining semantically meaningful representations of image datasets and are widely used prior to supervised fine-tuning. A recent self-supervised learning method, $t$-SimCNE, uses contrastive learning to directly train a 2D representation suitable for visualisation. When applied to natural image data… ▽ More Self-supervised learning methods based on data augmentations, such as SimCLR, BYOL, or DINO, allow obtaining semantically meaningful representations of image datasets and are widely used prior to supervised fine-tuning. A recent self-supervised learning method, $t$-SimCNE, uses contrastive learning to directly train a 2D representation suitable for visualisation. When applied to natural image datasets, $t$-SimCNE yields 2D visualisations with semantically meaningful clusters. In this work, we used $t$-SimCNE to visualise medical image datasets, including examples from dermatology, histology, and blood microscopy. We found that increasing the set of data augmentations to include arbitrary rotations improved the results in terms of class separability, compared to data augmentations used for natural images. Our 2D representations show medically relevant structures and can be used to aid data exploration and annotation, improving on common approaches for data visualisation. △ Less

Submitted 24 July, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

arXiv:2311.03087 [pdf, other]

Persistent Homology for High-dimensional Data Based on Spectral Methods

Authors: Sebastian Damrich, Philipp Berens, Dmitry Kobak

Abstract: Persistent homology is a popular computational tool for analyzing the topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case traditional persistent homology becomes very sensitive to noise and fails to detect the correct topology. The sa… ▽ More Persistent homology is a popular computational tool for analyzing the topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case traditional persistent homology becomes very sensitive to noise and fails to detect the correct topology. The same holds true for existing refinements of persistent homology. As a remedy, we find that spectral distances on the $k$-nearest-neighbor graph of the data, such as diffusion distance and effective resistance, allow to detect the correct topology even in the presence of high-dimensional noise. Moreover, we derive a novel closed-form formula for effective resistance, and describe its relation to diffusion distances. Finally, we apply these methods to high-dimensional single-cell RNA-sequencing data and show that spectral distances allow robust detection of cell cycle loops. △ Less

Submitted 8 May, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: 48 pages, 39 figures

arXiv:2210.09879 [pdf, other]

Unsupervised visualization of image datasets using contrastive learning

Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

Abstract: Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbo… ▽ More Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by self-supervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. T-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers. △ Less

Submitted 28 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: ICLR 2023

Journal ref: ICLR 2023

arXiv:2206.01816 [pdf, other]

From $t$-SNE to UMAP with contrastive learning

Authors: Sebastian Damrich, Jan Niklas Böhm, Fred A. Hamprecht, Dmitry Kobak

Abstract: Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship bet… ▽ More Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between $t$-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noise-contrastive estimation can be used to optimize $t$-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to $t$-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) $t$-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation. △ Less

Submitted 28 February, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: ICLR 2023. 44 pages, 19 figures. Code at https://github.com/hci-unihd/cl-tsne-umap and https://github.com/berenslab/contrastive-ne

Journal ref: ICLR 2023

arXiv:2205.07531 [pdf, other]

doi 10.1007/978-3-031-26387-3_7

Wasserstein t-SNE

Authors: Fynn Bachmann, Philipp Hennig, Dmitry Kobak

Abstract: Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the… ▽ More Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data. △ Less

Submitted 23 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

Comments: 16 pages, 10 figures, to be published at ECML/PKDD 2022

Journal ref: ECML PKDD 2022

arXiv:2011.14439 [pdf, other]

Scaling Down Deep Learning with MNIST-1D

Authors: Sam Greydanus, Dmitry Kobak

Abstract: Although deep learning models have taken on commercial and political relevance, key aspects of their training and operation remain poorly understood. This has sparked interest in science of deep learning projects, many of which require large amounts of time, money, and electricity. But how much of this research really needs to occur at scale? In this paper, we introduce MNIST-1D: a minimalist, pro… ▽ More Although deep learning models have taken on commercial and political relevance, key aspects of their training and operation remain poorly understood. This has sparked interest in science of deep learning projects, many of which require large amounts of time, money, and electricity. But how much of this research really needs to occur at scale? In this paper, we introduce MNIST-1D: a minimalist, procedurally generated, low-memory, and low-compute alternative to classic deep learning benchmarks. Although the dimensionality of MNIST-1D is only 40 and its default training set size only 4000, MNIST-1D can be used to study inductive biases of different deep architectures, find lottery tickets, observe deep double descent, metalearn an activation function, and demonstrate guillotine regularization in self-supervised learning. All these experiments can be conducted on a GPU or often even on a CPU within minutes, allowing for fast prototyping, educational use cases, and cutting-edge research on a low budget. △ Less

Submitted 3 June, 2024; v1 submitted 29 November, 2020; originally announced November 2020.

Comments: 12 pages, 11 figures

Journal ref: ICML 2024

arXiv:2007.08902 [pdf, other]

Attraction-Repulsion Spectrum in Neighbor Embeddings

Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

Abstract: Neighbor embeddings are a family of methods for visualizing complex high-dimensional datasets using $k$NN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between th… ▽ More Neighbor embeddings are a family of methods for visualizing complex high-dimensional datasets using $k$NN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between the attractive and the repulsive forces in t-SNE using the exaggeration parameter yields a spectrum of embeddings, which is characterized by a simple trade-off: stronger attraction can better represent continuous manifold structures, while stronger repulsion can better represent discrete cluster structures and yields higher $k$NN recall. We find that UMAP embeddings correspond to t-SNE with increased attraction; mathematical analysis shows that this is because the negative sampling optimisation strategy employed by UMAP strongly lowers the effective repulsion. Likewise, ForceAtlas2, commonly used for visualizing developmental single-cell transcriptomic data, yields embeddings corresponding to t-SNE with the attraction increased even more. At the extreme of this spectrum lie Laplacian Eigenmaps. Our results demonstrate that many prominent neighbor embedding algorithms can be placed onto the attraction-repulsion spectrum, and highlight the inherent trade-offs between them. △ Less

Submitted 18 October, 2022; v1 submitted 17 July, 2020; originally announced July 2020.

Journal ref: JMLR 23(95):1-32, 2022

arXiv:2006.10411 [pdf, other]

Sparse bottleneck neural networks for exploratory non-linear visualization of Patch-seq data

Authors: Yves Bernaerts, Philipp Berens, Dmitry Kobak

Abstract: Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with and without a two-dimensio… ▽ More Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with and without a two-dimensional bottleneck to predict electrophysiological features from the transcriptomic ones using a group lasso penalty, yielding concise and biologically interpretable two-dimensional visualizations. In two large example data sets, this visualization reveals known neural classes and their marker genes without biological prior knowledge. We also demonstrate that our method is applicable to other kinds of multimodal data, such as paired transcriptomic and proteomic measurements provided by CITE-seq. △ Less

Submitted 25 January, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: 17 pages, 16 figures

arXiv:1902.05804 [pdf, other]

doi 10.1007/978-3-030-46150-8_8

Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations

Authors: Dmitry Kobak, George Linderman, Stefan Steinerberger, Yuval Kluger, Philipp Berens

Abstract: T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the "crowding problem" of SNE. Here, we develop an efficient implementation of t-SNE for a $t$-distribution kernel with an arbitrary degree of fre… ▽ More T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the "crowding problem" of SNE. Here, we develop an efficient implementation of t-SNE for a $t$-distribution kernel with an arbitrary degree of freedom $ν$, with $ν\to\infty$ corresponding to SNE and $ν=1$ corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that $ν<1$ can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data. △ Less

Submitted 4 April, 2019; v1 submitted 15 February, 2019; originally announced February 2019.

Journal ref: ECML PKDD 2019

arXiv:1805.10939 [pdf, other]

Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization

Authors: Dmitry Kobak, Jonathan Lomond, Benoit Sanchez

Abstract: A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. Here we show that this rule can be violated by linear regression in the underdetermined $n\ll p$ situation under realistic conditions. Using simulations and real-life high-dimensional data sets, we demonstrate that an explicit positive ridge penalty can fail to provide any impro… ▽ More A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. Here we show that this rule can be violated by linear regression in the underdetermined $n\ll p$ situation under realistic conditions. Using simulations and real-life high-dimensional data sets, we demonstrate that an explicit positive ridge penalty can fail to provide any improvement over the minimum-norm least squares estimator. Moreover, the optimal value of ridge penalty in this situation can be negative. This happens when the high-variance directions in the predictor space can predict the response variable, which is often the case in the real-world high-dimensional data. In this regime, low-variance directions provide an implicit ridge regularization and can make any further positive ridge penalty detrimental. We prove that augmenting any linear model with random covariates and using minimum-norm estimator is asymptotically equivalent to adding the ridge penalty. We use a spiked covariance model as an analytically tractable example and prove that the optimal ridge penalty in this case is negative when $n\ll p$. △ Less

Submitted 9 April, 2020; v1 submitted 28 May, 2018; originally announced May 2018.

Journal ref: JMLR 21(169):1-16, 2020

arXiv:1804.09495 [pdf, other]

doi 10.1111/j.1740-9713.2018.01141.x

Putin's peaks: Russian election data revisited

Authors: Dmitry Kobak, Sergey Shpilkin, Maxim S. Pshenichnikov

Abstract: We study the anomalous prevalence of integer percentages in the last parliamentary (2016) and presidential (2018) Russian elections. We show how this anomaly in Russian federal elections has evolved since 2000. We study the anomalous prevalence of integer percentages in the last parliamentary (2016) and presidential (2018) Russian elections. We show how this anomaly in Russian federal elections has evolved since 2000. △ Less

Submitted 25 April, 2018; originally announced April 2018.

Comments: To appear in Significance magazine

Journal ref: Significance 15 (3), 2018, 8-9

arXiv:1410.6059 [pdf, other]

doi 10.1214/16-AOAS904

Integer percentages as electoral falsification fingerprints

Authors: Dmitry Kobak, Sergey Shpilkin, Maxim S. Pshenichnikov

Abstract: We hypothesize that if election results are manipulated or forged, then, due to the well-known human attraction to round numbers, the frequency of reported round percentages can be increased. To test this hypothesis, we analyzed raw data from seven federal elections held in the Russian Federation during the period from 2000 to 2012 and found that in all elections since 2004 the number of polling s… ▽ More We hypothesize that if election results are manipulated or forged, then, due to the well-known human attraction to round numbers, the frequency of reported round percentages can be increased. To test this hypothesis, we analyzed raw data from seven federal elections held in the Russian Federation during the period from 2000 to 2012 and found that in all elections since 2004 the number of polling stations reporting turnout and/or leader's result expressed by an integer percentage (as opposed to a fractional value) was much higher than expected by pure chance. Monte Carlo simulations confirmed high statistical significance of the observed phenomenon, thereby suggesting its man-made nature. Geographical analysis showed that these anomalies were concentrated in a specific subset of Russian regions which strongly suggests its orchestrated origin. Unlike previously proposed statistical indicators of alleged electoral falsifications, our observations can hardly be explained differently but by a widespread election fraud. △ Less

Submitted 29 June, 2016; v1 submitted 22 October, 2014; originally announced October 2014.

Comments: Published at http://dx.doi.org/10.1214/16-AOAS904 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS904

Journal ref: Annals of Applied Statistics 2016, Vol. 10, No. 1, 54-73

arXiv:1410.6049 [pdf]

doi 10.1152/jn.00631.2020

Motor skill learning by increasing the movement planning horizon

Authors: Luke Bashford, Dmitry Kobak, Carsten Mehring

Abstract: We investigated motor skill learning using a path tracking task, where human subjects had to track various curved paths as fast as possible, in the absence of any external perturbations. Subjects became better with practice, producing faster and smoother movements even when tracking novel untrained paths. Using a "searchlight" paradigm, where only a short segment of the path ahead of the cursor wa… ▽ More We investigated motor skill learning using a path tracking task, where human subjects had to track various curved paths as fast as possible, in the absence of any external perturbations. Subjects became better with practice, producing faster and smoother movements even when tracking novel untrained paths. Using a "searchlight" paradigm, where only a short segment of the path ahead of the cursor was shown, we found that subjects with a higher tracking skill took a longer chunk of the future path into account when computing the control policy for the upcoming movement segment. We observed the same effects in a second experiment where tracking speed was fixed and subjects were practicing to increase their accuracy. These findings demonstrate that human subjects increase their planning horizon when acquiring a motor skill. △ Less

Submitted 17 October, 2015; v1 submitted 22 October, 2014; originally announced October 2014.

Comments: 45 pages, 7 figures

Journal ref: Journal of Neurophysiology 127 (4) 2022, 995-1006

arXiv:1410.6031 [pdf, other]

doi 10.7554/eLife.10989

Demixed principal component analysis of population activity in higher cortical areas reveals independent representation of task parameters

Authors: Dmitry Kobak, Wieland Brendel, Christos Constantinidis, Claudia E. Feierstein, Adam Kepecs, Zachary F. Mainen, Ranulfo Romo, Xue-Lian Qi, Naoshige Uchida, Christian K. Machens

Abstract: Neurons in higher cortical areas, such as the prefrontal cortex, are known to be tuned to a variety of sensory and motor variables. The resulting diversity of neural tuning often obscures the represented information. Here we introduce a novel dimensionality reduction technique, demixed principal component analysis (dPCA), which automatically discovers and highlights the essential features in compl… ▽ More Neurons in higher cortical areas, such as the prefrontal cortex, are known to be tuned to a variety of sensory and motor variables. The resulting diversity of neural tuning often obscures the represented information. Here we introduce a novel dimensionality reduction technique, demixed principal component analysis (dPCA), which automatically discovers and highlights the essential features in complex population activities. We reanalyze population data from the prefrontal areas of rats and monkeys performing a variety of working memory and decision-making tasks. In each case, dPCA summarizes the relevant features of the population response in a single figure. The population activity is decomposed into a few demixed components that capture most of the variance in the data and that highlight dynamic tuning of the population to various task parameters, such as stimuli, decisions, rewards, etc. Moreover, dPCA reveals strong, condition-independent components of the population activity that remain unnoticed with conventional approaches. △ Less

Submitted 22 October, 2014; originally announced October 2014.

Comments: 23 pages, 6 figures + supplementary information (21 pages, 15 figures)

Journal ref: Elife 5, 2016

arXiv:1205.0741 [pdf, other]

Statistical anomalies in 2011-2012 Russian elections revealed by 2D correlation analysis

Authors: Dmitry Kobak, Sergey Shpilkin, Maxim S. Pshenichnikov

Abstract: Here we perform a statistical analysis of the official data from recent Russian parliamentary and presidential elections (held on December 4th, 2011 and March 4th, 2012, respectively). A number of anomalies are identified that persistently skew the results in favour of the pro-government party, United Russia (UR), and its leader Vladimir Putin. The main irregularities are: (i) remarkably high corr… ▽ More Here we perform a statistical analysis of the official data from recent Russian parliamentary and presidential elections (held on December 4th, 2011 and March 4th, 2012, respectively). A number of anomalies are identified that persistently skew the results in favour of the pro-government party, United Russia (UR), and its leader Vladimir Putin. The main irregularities are: (i) remarkably high correlation between turnout and voting results; (ii) a large number of polling stations where the UR/Putin results are given by a round number of percent; (iii) constituencies showing improbably low or (iv) anomalously high dispersion of results across polling stations; (v) substantial difference between results at paper-based and electronic polling stations. These anomalies, albeit less prominent in the presidential elections, hardly conform to the assumptions of fair and free voting. The approaches proposed here can be readily extended to quantify fingerprints of electoral fraud in any other problematic elections. △ Less

Submitted 17 May, 2012; v1 submitted 3 May, 2012; originally announced May 2012.

Comments: 12 pages, 5 figures; Methods slightly expanded

Showing 1–17 of 17 results for author: Kobak, D