Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–11 of 11 results for author: Kobak, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.07016  [pdf, other

    cs.CL cs.AI cs.CY cs.DL cs.SI

    Delving into ChatGPT usage in academic writing through excess vocabulary

    Authors: Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause

    Abstract: Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in th… ▽ More

    Submitted 3 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: v2: Updating dataset, figures and numbers to include all PubMed abstracts until end of June 2024

  2. arXiv:2404.08403  [pdf, other

    cs.CL cs.DL cs.LG

    Learning representations of learning representations

    Authors: Rita González-Márquez, Dmitry Kobak

    Abstract: The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the ICLR dataset consisting of abstracts of all 24 thousand ICLR submissions from 2017-2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transfor… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Journal ref: DMLR workshop at ICLR 2024

  3. arXiv:2402.14566  [pdf, other

    cs.CV

    Self-supervised Visualisation of Medical Image Datasets

    Authors: Ifeoma Veronica Nwabufo, Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

    Abstract: Self-supervised learning methods based on data augmentations, such as SimCLR, BYOL, or DINO, allow obtaining semantically meaningful representations of image datasets and are widely used prior to supervised fine-tuning. A recent self-supervised learning method, $t$-SimCNE, uses contrastive learning to directly train a 2D representation suitable for visualisation. When applied to natural image data… ▽ More

    Submitted 24 July, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  4. arXiv:2311.03087  [pdf, other

    cs.LG math.AT

    Persistent Homology for High-dimensional Data Based on Spectral Methods

    Authors: Sebastian Damrich, Philipp Berens, Dmitry Kobak

    Abstract: Persistent homology is a popular computational tool for analyzing the topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case traditional persistent homology becomes very sensitive to noise and fails to detect the correct topology. The sa… ▽ More

    Submitted 8 May, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: 48 pages, 39 figures

  5. arXiv:2210.09879  [pdf, other

    cs.LG cs.CV cs.HC

    Unsupervised visualization of image datasets using contrastive learning

    Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

    Abstract: Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbo… ▽ More

    Submitted 28 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

    Journal ref: ICLR 2023

  6. arXiv:2206.01816  [pdf, other

    cs.LG cs.HC

    From $t$-SNE to UMAP with contrastive learning

    Authors: Sebastian Damrich, Jan Niklas Böhm, Fred A. Hamprecht, Dmitry Kobak

    Abstract: Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship bet… ▽ More

    Submitted 28 February, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: ICLR 2023. 44 pages, 19 figures. Code at https://github.com/hci-unihd/cl-tsne-umap and https://github.com/berenslab/contrastive-ne

    Journal ref: ICLR 2023

  7. arXiv:2205.07531  [pdf, other

    cs.LG cs.HC stat.ML

    Wasserstein t-SNE

    Authors: Fynn Bachmann, Philipp Hennig, Dmitry Kobak

    Abstract: Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the… ▽ More

    Submitted 23 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 16 pages, 10 figures, to be published at ECML/PKDD 2022

    Journal ref: ECML PKDD 2022

  8. arXiv:2011.14439  [pdf, other

    cs.LG cs.NE stat.ML

    Scaling Down Deep Learning with MNIST-1D

    Authors: Sam Greydanus, Dmitry Kobak

    Abstract: Although deep learning models have taken on commercial and political relevance, key aspects of their training and operation remain poorly understood. This has sparked interest in science of deep learning projects, many of which require large amounts of time, money, and electricity. But how much of this research really needs to occur at scale? In this paper, we introduce MNIST-1D: a minimalist, pro… ▽ More

    Submitted 3 June, 2024; v1 submitted 29 November, 2020; originally announced November 2020.

    Comments: 12 pages, 11 figures

    Journal ref: ICML 2024

  9. arXiv:2007.08902  [pdf, other

    cs.LG stat.ML

    Attraction-Repulsion Spectrum in Neighbor Embeddings

    Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

    Abstract: Neighbor embeddings are a family of methods for visualizing complex high-dimensional datasets using $k$NN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between th… ▽ More

    Submitted 18 October, 2022; v1 submitted 17 July, 2020; originally announced July 2020.

    Journal ref: JMLR 23(95):1-32, 2022

  10. arXiv:2006.10411  [pdf, other

    cs.LG stat.ML

    Sparse bottleneck neural networks for exploratory non-linear visualization of Patch-seq data

    Authors: Yves Bernaerts, Philipp Berens, Dmitry Kobak

    Abstract: Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with and without a two-dimensio… ▽ More

    Submitted 25 January, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: 17 pages, 16 figures

  11. Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations

    Authors: Dmitry Kobak, George Linderman, Stefan Steinerberger, Yuval Kluger, Philipp Berens

    Abstract: T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the "crowding problem" of SNE. Here, we develop an efficient implementation of t-SNE for a $t$-distribution kernel with an arbitrary degree of fre… ▽ More

    Submitted 4 April, 2019; v1 submitted 15 February, 2019; originally announced February 2019.

    Journal ref: ECML PKDD 2019