Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–17 of 17 results for author: Kobak, D

.
  1. arXiv:2406.07016  [pdf, other

    cs.CL cs.AI cs.CY cs.DL cs.SI

    Delving into ChatGPT usage in academic writing through excess vocabulary

    Authors: Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause

    Abstract: Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in th… ▽ More

    Submitted 3 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: v2: Updating dataset, figures and numbers to include all PubMed abstracts until end of June 2024

  2. arXiv:2404.08403  [pdf, other

    cs.CL cs.DL cs.LG

    Learning representations of learning representations

    Authors: Rita González-Márquez, Dmitry Kobak

    Abstract: The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transfor… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Journal ref: DMLR workshop at ICLR 2024

  3. arXiv:2402.14566  [pdf, other

    cs.CV

    Self-supervised Visualisation of Medical Image Datasets

    Authors: Ifeoma Veronica Nwabufo, Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

    Abstract: Self-supervised learning methods based on data augmentations, such as SimCLR, BYOL, or DINO, allow obtaining semantically meaningful representations of image datasets and are widely used prior to supervised fine-tuning. A recent self-supervised learning method, $t$-SimCNE, uses contrastive learning to directly train a 2D representation suitable for visualisation. When applied to natural image data… ▽ More

    Submitted 24 July, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  4. arXiv:2311.03087  [pdf, other

    cs.LG math.AT

    Persistent Homology for High-dimensional Data Based on Spectral Methods

    Authors: Sebastian Damrich, Philipp Berens, Dmitry Kobak

    Abstract: Persistent homology is a popular computational tool for analyzing the topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case traditional persistent homology becomes very sensitive to noise and fails to detect the correct topology. The sa… ▽ More

    Submitted 8 May, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: 48 pages, 39 figures

  5. arXiv:2210.09879  [pdf, other

    cs.LG cs.CV cs.HC

    Unsupervised visualization of image datasets using contrastive learning

    Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

    Abstract: Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbo… ▽ More

    Submitted 28 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

    Journal ref: ICLR 2023

  6. arXiv:2206.01816  [pdf, other

    cs.LG cs.HC

    From $t$-SNE to UMAP with contrastive learning

    Authors: Sebastian Damrich, Jan Niklas Böhm, Fred A. Hamprecht, Dmitry Kobak

    Abstract: Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship bet… ▽ More

    Submitted 28 February, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: ICLR 2023. 44 pages, 19 figures. Code at https://github.com/hci-unihd/cl-tsne-umap and https://github.com/berenslab/contrastive-ne

    Journal ref: ICLR 2023

  7. arXiv:2205.07531  [pdf, other

    cs.LG cs.HC stat.ML

    Wasserstein t-SNE

    Authors: Fynn Bachmann, Philipp Hennig, Dmitry Kobak

    Abstract: Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the… ▽ More

    Submitted 23 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 16 pages, 10 figures, to be published at ECML/PKDD 2022

    Journal ref: ECML PKDD 2022

  8. arXiv:2011.14439  [pdf, other

    cs.LG cs.NE stat.ML

    Scaling Down Deep Learning with MNIST-1D

    Authors: Sam Greydanus, Dmitry Kobak

    Abstract: Although deep learning models have taken on commercial and political relevance, key aspects of their training and operation remain poorly understood. This has sparked interest in science of deep learning projects, many of which require large amounts of time, money, and electricity. But how much of this research really needs to occur at scale? In this paper, we introduce MNIST-1D: a minimalist, pro… ▽ More

    Submitted 3 June, 2024; v1 submitted 29 November, 2020; originally announced November 2020.

    Comments: 12 pages, 11 figures

    Journal ref: ICML 2024

  9. arXiv:2007.08902  [pdf, other

    cs.LG stat.ML

    Attraction-Repulsion Spectrum in Neighbor Embeddings

    Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

    Abstract: Neighbor embeddings are a family of methods for visualizing complex high-dimensional datasets using $k$NN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between th… ▽ More

    Submitted 18 October, 2022; v1 submitted 17 July, 2020; originally announced July 2020.

    Journal ref: JMLR 23(95):1-32, 2022

  10. arXiv:2006.10411  [pdf, other

    cs.LG stat.ML

    Sparse bottleneck neural networks for exploratory non-linear visualization of Patch-seq data

    Authors: Yves Bernaerts, Philipp Berens, Dmitry Kobak

    Abstract: Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with and without a two-dimensio… ▽ More

    Submitted 25 January, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: 17 pages, 16 figures

  11. Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations

    Authors: Dmitry Kobak, George Linderman, Stefan Steinerberger, Yuval Kluger, Philipp Berens

    Abstract: T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the "crowding problem" of SNE. Here, we develop an efficient implementation of t-SNE for a $t$-distribution kernel with an arbitrary degree of fre… ▽ More

    Submitted 4 April, 2019; v1 submitted 15 February, 2019; originally announced February 2019.

    Journal ref: ECML PKDD 2019

  12. arXiv:1805.10939  [pdf, other

    math.ST stat.ML

    Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization

    Authors: Dmitry Kobak, Jonathan Lomond, Benoit Sanchez

    Abstract: A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. Here we show that this rule can be violated by linear regression in the underdetermined $n\ll p$ situation under realistic conditions. Using simulations and real-life high-dimensional data sets, we demonstrate that an explicit positive ridge penalty can fail to provide any impro… ▽ More

    Submitted 9 April, 2020; v1 submitted 28 May, 2018; originally announced May 2018.

    Journal ref: JMLR 21(169):1-16, 2020

  13. Putin's peaks: Russian election data revisited

    Authors: Dmitry Kobak, Sergey Shpilkin, Maxim S. Pshenichnikov

    Abstract: We study the anomalous prevalence of integer percentages in the last parliamentary (2016) and presidential (2018) Russian elections. We show how this anomaly in Russian federal elections has evolved since 2000.

    Submitted 25 April, 2018; originally announced April 2018.

    Comments: To appear in Significance magazine

    Journal ref: Significance 15 (3), 2018, 8-9

  14. Integer percentages as electoral falsification fingerprints

    Authors: Dmitry Kobak, Sergey Shpilkin, Maxim S. Pshenichnikov

    Abstract: We hypothesize that if election results are manipulated or forged, then, due to the well-known human attraction to round numbers, the frequency of reported round percentages can be increased. To test this hypothesis, we analyzed raw data from seven federal elections held in the Russian Federation during the period from 2000 to 2012 and found that in all elections since 2004 the number of polling s… ▽ More

    Submitted 29 June, 2016; v1 submitted 22 October, 2014; originally announced October 2014.

    Comments: Published at http://dx.doi.org/10.1214/16-AOAS904 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS904

    Journal ref: Annals of Applied Statistics 2016, Vol. 10, No. 1, 54-73

  15. Motor skill learning by increasing the movement planning horizon

    Authors: Luke Bashford, Dmitry Kobak, Carsten Mehring

    Abstract: We investigated motor skill learning using a path tracking task, where human subjects had to track various curved paths as fast as possible, in the absence of any external perturbations. Subjects became better with practice, producing faster and smoother movements even when tracking novel untrained paths. Using a "searchlight" paradigm, where only a short segment of the path ahead of the cursor wa… ▽ More

    Submitted 17 October, 2015; v1 submitted 22 October, 2014; originally announced October 2014.

    Comments: 45 pages, 7 figures

    Journal ref: Journal of Neurophysiology 127 (4) 2022, 995-1006

  16. arXiv:1410.6031  [pdf, other

    q-bio.NC stat.ML

    Demixed principal component analysis of population activity in higher cortical areas reveals independent representation of task parameters

    Authors: Dmitry Kobak, Wieland Brendel, Christos Constantinidis, Claudia E. Feierstein, Adam Kepecs, Zachary F. Mainen, Ranulfo Romo, Xue-Lian Qi, Naoshige Uchida, Christian K. Machens

    Abstract: Neurons in higher cortical areas, such as the prefrontal cortex, are known to be tuned to a variety of sensory and motor variables. The resulting diversity of neural tuning often obscures the represented information. Here we introduce a novel dimensionality reduction technique, demixed principal component analysis (dPCA), which automatically discovers and highlights the essential features in compl… ▽ More

    Submitted 22 October, 2014; originally announced October 2014.

    Comments: 23 pages, 6 figures + supplementary information (21 pages, 15 figures)

    Journal ref: Elife 5, 2016

  17. arXiv:1205.0741  [pdf, other

    physics.soc-ph stat.AP

    Statistical anomalies in 2011-2012 Russian elections revealed by 2D correlation analysis

    Authors: Dmitry Kobak, Sergey Shpilkin, Maxim S. Pshenichnikov

    Abstract: Here we perform a statistical analysis of the official data from recent Russian parliamentary and presidential elections (held on December 4th, 2011 and March 4th, 2012, respectively). A number of anomalies are identified that persistently skew the results in favour of the pro-government party, United Russia (UR), and its leader Vladimir Putin. The main irregularities are: (i) remarkably high corr… ▽ More

    Submitted 17 May, 2012; v1 submitted 3 May, 2012; originally announced May 2012.

    Comments: 12 pages, 5 figures; Methods slightly expanded