Search | arXiv e-print repository

arXiv:2407.19739 [pdf]

Nomadic Non-Public Networks for 6G: Use Cases and Key Performance Indicators

Authors: Daniel Lindenschmitt, Benedikt Veith, Khurshid Alam, Ainur Aurembekova, Michael Gundall, Mohammad Asif Habibi, Bin Han, Dennis Krummacker, Philipp Rosemann, Hans D. Schotten

Abstract: The landscape of wireless communication systems is evolving rapidly, with a pivotal role envisioned for dynamic network structures and self-organizing networks in upcoming technologies like the 6G mobile communications standard. This evolution is fueled by the growing demand from diverse sectors, including industry, manufacturing, agriculture, and the public sector, each with increasingly specific… ▽ More The landscape of wireless communication systems is evolving rapidly, with a pivotal role envisioned for dynamic network structures and self-organizing networks in upcoming technologies like the 6G mobile communications standard. This evolution is fueled by the growing demand from diverse sectors, including industry, manufacturing, agriculture, and the public sector, each with increasingly specific requirements. The establishment of non-public networks in the current 5G standard has laid a foundation, enabling independent operation within certain frequencies and local limitations, notably for Internet of Things applications. This paper explores the progression from non-public networks to nomadic non-public networks and their significance in the context of the forthcoming 6G era. Building on existing work in dynamic network structures, non-public networks regulations, and alternative technological solutions, this paper introduces specific use cases enhanced by nomadic networks. In addition, relevant Key Performance Indicators are discussed on the basis of the presented use cases. These serve as a starting point for the definition of requirement clusters and thus for a evaluation metric of nomadic non-public networks. This work lays the groundwork for understanding the potential of nomadic non-public networks in the dynamic landscape of 6G wireless communication systems. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 8 pages, 1 figure

arXiv:2407.17710 [pdf, other]

Revisiting Machine Unlearning with Dimensional Alignment

Authors: Seonguk Seo, Dongwan Kim, Bohyung Han

Abstract: Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training inst… ▽ More Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training instability and undesired side effects. To fundamentally approach this task, we first analyze the changes in latent feature spaces between original and retrained models, and observe that the feature representations of samples not involved in training are closely aligned with the feature manifolds of previously seen samples in training. Based on these findings, we introduce a novel evaluation metric for machine unlearning, coined dimensional alignment, which measures the alignment between the eigenspaces of the forget and retain set samples. We employ this metric as a regularizer loss to build a robust and stable unlearning framework, which is further enhanced by integrating a self-distillation loss and an alternating training scheme. Our framework effectively eliminates information from the forget set and preserves knowledge from the retain set. Lastly, we identify critical flaws in established evaluation metrics for machine unlearning, and introduce new evaluation tools that more accurately reflect the fundamental goals of machine unlearning. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.14902 [pdf, other]

Infrared magneto-polaritons in MoTe$_2$ mono- and bilayers

Authors: Bo Han, Jamie M. Fitzgerald, Lukas Lackner, Roberto Rosati, Martin Esmann, Falk Eilenberger, Takashi Taniguchi, Kenji Watanabe, Marcin Syperek, Ermin Malic, Christian Schneider

Abstract: MoTe$_2$ monolayers and bilayers are unique within the family of van-der-Waals materials since they pave the way towards atomically thin infrared light-matter quantum interfaces, potentially reaching the important telecommunication windows. Here, we report emergent exciton-polaritons based on MoTe$_2$ monolayer and bilayer in a low-temperature open micro-cavity in a joint experiment-theory study.… ▽ More MoTe$_2$ monolayers and bilayers are unique within the family of van-der-Waals materials since they pave the way towards atomically thin infrared light-matter quantum interfaces, potentially reaching the important telecommunication windows. Here, we report emergent exciton-polaritons based on MoTe$_2$ monolayer and bilayer in a low-temperature open micro-cavity in a joint experiment-theory study. Our experiments clearly evidence both the enhanced oscillator strength and enhanced luminescence of MoTe$_2$ bilayers, signified by a 38 \% increase of the Rabi-splitting and a strongly enhanced relaxation of polaritons to low-energy states. The latter is distinct from polaritons in MoTe$_2$ monolayers, which feature a bottleneck-like relaxation inhibition. Both the polaritonic spin-valley locking in monolayers and the spin-layer locking in bilayers are revealed via the Zeeman effect, which we map and control via the light-matter composition of our polaritonic resonances. △ Less

Submitted 20 July, 2024; originally announced July 2024.

Comments: 6 pages, 3 figures

arXiv:2407.14841 [pdf, other]

Text-based Talking Video Editing with Cascaded Conditional Diffusion

Authors: Bo Han, Heqing Zou, Haoyang Li, Guangcong Wang, Chng Eng Siong

Abstract: Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face vi… ▽ More Text-based talking-head video editing aims to efficiently insert, delete, and substitute segments of talking videos through a user-friendly text editing approach. It is challenging because of \textbf{1)} generalizable talking-face representation, \textbf{2)} seamless audio-visual transitions, and \textbf{3)} identity-preserved talking faces. Previous works either require minutes of talking-face video training data and expensive test-time optimization for customized talking video editing or directly generate a video sequence without considering in-context information, leading to a poor generalizable representation, or incoherent transitions, or even inconsistent identity. In this paper, we propose an efficient cascaded conditional diffusion-based framework, which consists of two stages: audio to dense-landmark motion and motion to video. \textit{\textbf{In the first stage}}, we first propose a dynamic weighted in-context diffusion module to synthesize dense-landmark motions given an edited audio. \textit{\textbf{In the second stage}}, we introduce a warping-guided conditional diffusion module. The module first interpolates between the start and end frames of the editing interval to generate smooth intermediate frames. Then, with the help of the audio-to-dense motion images, these intermediate frames are warped to obtain coarse intermediate frames. Conditioned on the warped intermedia frames, a diffusion model is adopted to generate detailed and high-resolution target frames, which guarantees coherent and identity-preserved transitions. The cascaded conditional diffusion model decomposes the complex talking editing task into two flexible generation tasks, which provides a generalizable talking-face representation, seamless audio-visual transitions, and identity-preserved faces on a small dataset. Experiments show the effectiveness and superiority of the proposed method. △ Less

Submitted 20 July, 2024; originally announced July 2024.

arXiv:2407.14129 [pdf, other]

Comparing and Contrasting Deep Learning Weather Prediction Backbones on Navier-Stokes and Atmospheric Dynamics

Authors: Matthias Karlbauer, Danielle C. Maddix, Abdul Fatir Ansari, Boran Han, Gaurav Gupta, Yuyang Wang, Andrew Stuart, Michael W. Mahoney

Abstract: Remarkable progress in the development of Deep Learning Weather Prediction (DLWP) models positions them to become competitive with traditional numerical weather prediction (NWP) models. Indeed, a wide number of DLWP architectures -- based on various backbones, including U-Net, Transformer, Graph Neural Network (GNN), and Fourier Neural Operator (FNO) -- have demonstrated their potential at forecas… ▽ More Remarkable progress in the development of Deep Learning Weather Prediction (DLWP) models positions them to become competitive with traditional numerical weather prediction (NWP) models. Indeed, a wide number of DLWP architectures -- based on various backbones, including U-Net, Transformer, Graph Neural Network (GNN), and Fourier Neural Operator (FNO) -- have demonstrated their potential at forecasting atmospheric states. However, due to differences in training protocols, forecast horizons, and data choices, it remains unclear which (if any) of these methods and architectures are most suitable for weather forecasting and for future model development. Here, we step back and provide a detailed empirical analysis, under controlled conditions, comparing and contrasting the most prominent DLWP models, along with their backbones. We accomplish this by predicting synthetic two-dimensional incompressible Navier-Stokes and real-world global weather dynamics. In terms of accuracy, memory consumption, and runtime, our results illustrate various tradeoffs. For example, on synthetic data, we observe favorable performance of FNO; and on the real-world WeatherBench dataset, our results demonstrate the suitability of ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged weather rollouts of up to 365 days, we observe superior stability and physical soundness in architectures that formulate a spherical data representation, i.e., GraphCast and Spherical FNO. In addition, we observe that all of these model backbones ``saturate,'' i.e., none of them exhibit so-called neural scaling, which highlights an important direction for future work on these and related models. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.12076 [pdf, ps, other]

Colored Multiset Eulerian Polynomials

Authors: Danai Deligeorgaki, Bin Han, Liam Solus

Abstract: Colored multiset Eulerian polynomials are a common generalization of MacMahon's multiset Eulerian polynomials and the colored Eulerian polynomials, both of which are known to satisfy well-studied distributional properties including real-rootedness, log-concavity and unimodality. The symmetric colored multiset Eulerian polynomials are characterized and used to prove sufficient conditions for a colo… ▽ More Colored multiset Eulerian polynomials are a common generalization of MacMahon's multiset Eulerian polynomials and the colored Eulerian polynomials, both of which are known to satisfy well-studied distributional properties including real-rootedness, log-concavity and unimodality. The symmetric colored multiset Eulerian polynomials are characterized and used to prove sufficient conditions for a colored multiset Eulerian polynomial to be self-interlacing. The latter property implies the aforementioned distributional properties as well as others, including the alternatingly increasing property and bi-$γ$-positivity. To derive these results, multivariate generalizations of an identity due to MacMahon are deduced. The results are applied to a pair of questions, both previously studied in several special cases, that are seen to admit more general answers when framed in the context of colored multiset Eulerian polynomials. The first question pertains to $s$-Eulerian polynomials, and the second to interpretations of $γ$-coefficients. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 23 pages

arXiv:2407.08551 [pdf, other]

Autoregressive Speech Synthesis without Vector Quantization

Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross… ▽ More We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.06827 [pdf, ps, other]

On the support of solutions to nonlinear stochastic heat equations

Authors: Beom-Seok Han, Kunwoo Kim, Jaeyun Yi

Abstract: We investigate the strict positivity and the compact support property of solutions to the one-dimensional nonlinear stochastic heat equation: $$\partial_t u(t,x) = \frac{1}{2}\partial^2_x u(t,x) + σ(u(t,x))\dot{W}(t,x), \quad (t,x)\in \mathbf{R}_+\times\mathbf{R},$$ with nonnegative and compactly supported initial data $u_0$, where $\dot{W}$ is the space-time white noise and… ▽ More We investigate the strict positivity and the compact support property of solutions to the one-dimensional nonlinear stochastic heat equation: $$\partial_t u(t,x) = \frac{1}{2}\partial^2_x u(t,x) + σ(u(t,x))\dot{W}(t,x), \quad (t,x)\in \mathbf{R}_+\times\mathbf{R},$$ with nonnegative and compactly supported initial data $u_0$, where $\dot{W}$ is the space-time white noise and $σ:\mathbf{R} \to \mathbf{R} $ is a continuous function with $σ(0)=0$. We prove that (i) if $v/ σ(v)$ is sufficiently large near $v=0$, then the solution $u(t,\cdot)$ is strictly positive for all $t>0$, and (ii) if $v/σ(v)$ is sufficiently small near $v= 0$, then the solution $u(t,\cdot)$ has compact support for all $t>0$. These findings extend previous results concerning the strict positivity and the compact support property, which were analyzed only for the case $σ(u)\approx u^γ$ for $γ>0$. Additionally, we establish the uniqueness of a solution and the weak comparison principle in case (i). △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 31 pages

MSC Class: 60H15; 35R60

arXiv:2407.04557 [pdf, other]

Structural Constraint Integration in Generative Model for Discovery of Quantum Material Candidates

Authors: Ryotaro Okabe, Mouyang Cheng, Abhijatmedhi Chotrattanapituk, Nguyen Tuan Hung, Xiang Fu, Bowen Han, Yao Wang, Weiwei Xie, Robert J. Cava, Tommi S. Jaakkola, Yongqiang Cheng, Mingda Li

Abstract: Billions of organic molecules are known, but only a tiny fraction of the functional inorganic materials have been discovered, a particularly relevant problem to the community searching for new quantum materials. Recent advancements in machine-learning-based generative models, particularly diffusion models, show great promise for generating new, stable materials. However, integrating geometric patt… ▽ More Billions of organic molecules are known, but only a tiny fraction of the functional inorganic materials have been discovered, a particularly relevant problem to the community searching for new quantum materials. Recent advancements in machine-learning-based generative models, particularly diffusion models, show great promise for generating new, stable materials. However, integrating geometric patterns into materials generation remains a challenge. Here, we introduce Structural Constraint Integration in the GENerative model (SCIGEN). Our approach can modify any trained generative diffusion model by strategic masking of the denoised structure with a diffused constrained structure prior to each diffusion step to steer the generation toward constrained outputs. Furthermore, we mathematically prove that SCIGEN effectively performs conditional sampling from the original distribution, which is crucial for generating stable constrained materials. We generate eight million compounds using Archimedean lattices as prototype constraints, with over 10% surviving a multi-staged stability pre-screening. High-throughput density functional theory (DFT) on 26,000 survived compounds shows that over 50% passed structural optimization at the DFT level. Since the properties of quantum materials are closely related to geometric patterns, our results indicate that SCIGEN provides a general framework for generating quantum materials candidates. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 512 pages total, 4 main figures + 218 supplementary figures

arXiv:2407.04029 [pdf, other]

Robust Learning under Hybrid Noise

Authors: Yang Wei, Shuo Chen, Shanshan Ye, Bo Han, Chen Gong

Abstract: Feature noise and label noise are ubiquitous in practical scenarios, which pose great challenges for training a robust machine learning model. Most previous approaches usually deal with only a single problem of either feature noise or label noise. However, in real-world applications, hybrid noise, which contains both feature noise and label noise, is very common due to the unreliable data collecti… ▽ More Feature noise and label noise are ubiquitous in practical scenarios, which pose great challenges for training a robust machine learning model. Most previous approaches usually deal with only a single problem of either feature noise or label noise. However, in real-world applications, hybrid noise, which contains both feature noise and label noise, is very common due to the unreliable data collection and annotation processes. Although some results have been achieved by a few representation learning based attempts, this issue is still far from being addressed with promising performance and guaranteed theoretical analyses. To address the challenge, we propose a novel unified learning framework called "Feature and Label Recovery" (FLR) to combat the hybrid noise from the perspective of data recovery, where we concurrently reconstruct both the feature matrix and the label matrix of input data. Specifically, the clean feature matrix is discovered by the low-rank approximation, and the ground-truth label matrix is embedded based on the recovered features with a nuclear norm regularization. Meanwhile, the feature noise and label noise are characterized by their respective adaptive matrix norms to satisfy the corresponding maximum likelihood. As this framework leads to a non-convex optimization problem, we develop the non-convex Alternating Direction Method of Multipliers (ADMM) with the convergence guarantee to solve our learning objective. We also provide the theoretical analysis to show that the generalization error of FLR can be upper-bounded in the presence of hybrid noise. Experimental results on several typical benchmark datasets clearly demonstrate the superiority of our proposed method over the state-of-the-art robust learning approaches for various noises. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.00750 [pdf, other]

Physical Layer Deception with Non-Orthogonal Multiplexing

Authors: Wenwen Chen, Bin Han, Yao Zhu, Anke Schmeink, Giuseppe Caire, Hans D. Schotten

Abstract: Physical layer security (PLS) is a promising technology to secure wireless communications by exploiting the physical properties of the wireless channel. However, the passive nature of PLS creates a significant imbalance between the effort required by eavesdroppers and legitimate users to secure data. To address this imbalance, in this article, we propose a novel framework of physical layer decepti… ▽ More Physical layer security (PLS) is a promising technology to secure wireless communications by exploiting the physical properties of the wireless channel. However, the passive nature of PLS creates a significant imbalance between the effort required by eavesdroppers and legitimate users to secure data. To address this imbalance, in this article, we propose a novel framework of physical layer deception (PLD), which combines PLS with deception technologies to actively counteract wiretapping attempts. Combining a two-stage encoder with randomized ciphering and non-orthogonal multiplexing, the PLD approach enables the wireless communication system to proactively counter eavesdroppers with deceptive messages. Relying solely on the superiority of the legitimate channel over the eavesdropping channel, the PLD framework can effectively protect the confidentiality of the transmitted messages, even against eavesdroppers who possess knowledge equivalent to that of the legitimate receiver. We prove the validity of the PLD framework with in-depth analyses and demonstrate its superiority over conventional PLS approaches with comprehensive numerical benchmarks. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: Submitted to IEEE Transactions on Wireless Communications

arXiv:2406.19930 [pdf]

Exploring 6G Potential for Industrial Digital Twinning and Swarm Intelligence in Obstacle-Rich Environments

Authors: Siyu Yuan, Khurshid Alam, Bin Han, Dennis Krummacker, Hans D. Schotten

Abstract: With the advent of 6G technology, the demand for efficient and intelligent systems in industrial applications has surged, driving the need for advanced solutions in target localization. Utilizing swarm robots to locate unknown targets involves navigating increasingly complex environments. Digital Twinning (DT) offers a robust solution by creating a virtual replica of the physical world, which enha… ▽ More With the advent of 6G technology, the demand for efficient and intelligent systems in industrial applications has surged, driving the need for advanced solutions in target localization. Utilizing swarm robots to locate unknown targets involves navigating increasingly complex environments. Digital Twinning (DT) offers a robust solution by creating a virtual replica of the physical world, which enhances the swarm's navigation capabilities. Our framework leverages DT and integrates Swarm Intelligence to store physical map information in the cloud, enabling robots to efficiently locate unknown targets. The simulation results demonstrate that the DT framework, augmented by Swarm Intelligence, significantly improves target location efficiency in obstacle-rich environments compared to traditional methods. This research underscores the potential of combining DT and Swarm Intelligence to advance the field of robotic navigation and target localization in complex industrial settings. △ Less

Submitted 2 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

Comments: Submitted to IEEE VTM

arXiv:2406.18269 [pdf, other]

Refining Potential Energy Surface through Dynamical Properties via Differentiable Molecular Simulation

Authors: Bin Han, Kuang Yu

Abstract: Recently, machine learning potentials (MLP) largely enhances the reliability of molecular dynamics, but its accuracy is limited by the underlying $\textit{ab initio}$ methods. A viable approach to overcome this limitation is to refine the potential by learning from experimental data, which now can be done efficiently using modern automatic differentiation technique. However, potential refinement i… ▽ More Recently, machine learning potentials (MLP) largely enhances the reliability of molecular dynamics, but its accuracy is limited by the underlying $\textit{ab initio}$ methods. A viable approach to overcome this limitation is to refine the potential by learning from experimental data, which now can be done efficiently using modern automatic differentiation technique. However, potential refinement is mostly performed using thermodynamic properties, leaving the most accessible and informative dynamical data (like spectroscopy) unexploited. In this work, through a comprehensive application of adjoint and gradient truncation methods, we show that both memory and gradient explosion issues can be circumvented in many situations, so the dynamical property differentiation is well-behaved. Consequently, both transport coefficients and spectroscopic data can be used to improve the density functional theory based MLP towards higher accuracy. Essentially, this work contributes to the solution of the inverse problem of spectroscopy by extracting microscopic interactions from vibrational spectroscopic data. △ Less

Submitted 27 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17196 [pdf, other]

Coded Kalman Filtering over MIMO Gaussian Channels with Feedback

Authors: Barron Han, Oron Sabag, Victoria Kostina, Babak Hassibi

Abstract: We consider the problem of remotely stabilizing a linear dynamical system. In this setting, a sensor co-located with the system communicates the system's state to a controller over a noisy communication channel with feedback. The objective of the controller (decoder) is to use the channel outputs to estimate the vector state with finite zero-delay mean squared error (MSE) at the infinite horizon.… ▽ More We consider the problem of remotely stabilizing a linear dynamical system. In this setting, a sensor co-located with the system communicates the system's state to a controller over a noisy communication channel with feedback. The objective of the controller (decoder) is to use the channel outputs to estimate the vector state with finite zero-delay mean squared error (MSE) at the infinite horizon. It has been shown in [1] that for a vector Gauss-Markov source and either a single-input multiple-output (SIMO) or a multiple-input single-output (MISO) channel, linear codes require the minimum capacity to achieve finite MSE. This paper considers the more general problem of linear zero-delay joint-source channel coding (JSCC) of a vector-valued source over a multiple-input multiple-output (MIMO) Gaussian channel with feedback. We study sufficient and necessary conditions for linear codes to achieve finite MSE. For sufficiency, we introduce a coding scheme where each unstable source mode is allocated to a single channel for estimation. Our proof for the necessity of this scheme relies on a matrix-algebraic conjecture that we prove to be true if either the source or channel is scalar. We show that linear codes achieve finite MSE for a scalar source over a MIMO channel if and only if the best scalar sub-channel can achieve finite MSE. Finally, we provide a new counter-example demonstrating that linear codes are generally sub-optimal for coding over MIMO channels. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at the 2024 IEEE International Symposium on Information Theory

arXiv:2406.13918 [pdf, other]

Are We There Yet? Unravelling Usability Challenges and Opportunities in Collaborative Immersive Analytics for Domain Experts

Authors: Fahim Arsad Nafis, Alexander Rose, Simon Su, Songqing Chen, Bo Han

Abstract: In the ever-evolving discipline of high-dimensional scientific data, collaborative immersive analytics (CIA) offers a promising frontier for domain experts in complex data visualization and interpretation. This research presents a comprehensive framework for conducting usability studies on the extended reality (XR) interface of ParaView, an open-source CIA system. By employing established human-co… ▽ More In the ever-evolving discipline of high-dimensional scientific data, collaborative immersive analytics (CIA) offers a promising frontier for domain experts in complex data visualization and interpretation. This research presents a comprehensive framework for conducting usability studies on the extended reality (XR) interface of ParaView, an open-source CIA system. By employing established human-computer interaction (HCI) principles, including Jakob Nielsen's Usability Heuristics, Cognitive Load Theory, NASA Task Load Index, System Usability Scale, Affordance Theory, and Gulf of Execution and Evaluation, this study aims to identify underlying usability issues and provide guidelines for enhancing user experience in scientific domains. Our findings reveal significant usability challenges in the ParaView XR interface that impede effective teamwork and collaboration. For instance, the lack of synchronous collaboration, limited communication methods, and the absence of role-based data access are critical areas that need attention. Additionally, inadequate error handling, insufficient feedback mechanisms, and limited support resources during application use require extensive improvement to fully utilize the system's potential. Our study suggests potential improvements to overcome the existing usability barriers of the collaborative immersive system. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted in 26th International Conference on Human-Computer Interaction, HCII 2024, Washington, DC, USA

arXiv:2406.11824 [pdf, other]

Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation

Authors: Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, Jia Deng

Abstract: We introduce Infinigen Indoors, a Blender-based procedural generator of photorealistic indoor scenes. It builds upon the existing Infinigen system, which focuses on natural scenes, but expands its coverage to indoor scenes by introducing a diverse library of procedural indoor assets, including furniture, architecture elements, appliances, and other day-to-day objects. It also introduces a constrai… ▽ More We introduce Infinigen Indoors, a Blender-based procedural generator of photorealistic indoor scenes. It builds upon the existing Infinigen system, which focuses on natural scenes, but expands its coverage to indoor scenes by introducing a diverse library of procedural indoor assets, including furniture, architecture elements, appliances, and other day-to-day objects. It also introduces a constraint-based arrangement system, which consists of a domain-specific language for expressing diverse constraints on scene composition, and a solver that generates scene compositions that maximally satisfy the constraints. We provide an export tool that allows the generated 3D objects and scenes to be directly used for training embodied agents in real-time simulators such as Omniverse and Unreal. Infinigen Indoors is open-sourced under the BSD license. Please visit https://infinigen.org for code and videos. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted to CVPR 2024

arXiv:2406.11793 [pdf, other]

FetchBench: A Simulation Benchmark for Robot Fetching

Authors: Beining Han, Meenal Parakh, Derek Geng, Jack A Defay, Luyang Gan, Jia Deng

Abstract: Fetching, which includes approaching, grasping, and retrieving, is a critical challenge for robot manipulation tasks. Existing methods primarily focus on table-top scenarios, which do not adequately capture the complexities of environments where both grasping and planning are essential. To address this gap, we propose a new benchmark FetchBench, featuring diverse procedural scenes that integrate b… ▽ More Fetching, which includes approaching, grasping, and retrieving, is a critical challenge for robot manipulation tasks. Existing methods primarily focus on table-top scenarios, which do not adequately capture the complexities of environments where both grasping and planning are essential. To address this gap, we propose a new benchmark FetchBench, featuring diverse procedural scenes that integrate both grasping and motion planning challenges. Additionally, FetchBench includes a data generation pipeline that collects successful fetch trajectories for use in imitation learning methods. We implement multiple baselines from the traditional sense-plan-act pipeline to end-to-end behavior models. Our empirical analysis reveals that these methods achieve a maximum success rate of only 20%, indicating substantial room for improvement. Additionally, we identify key bottlenecks within the sense-plan-act pipeline and make recommendations based on the systematic analysis. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11364 [pdf, other]

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection

Authors: Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan

Abstract: Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res… ▽ More Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.10881 [pdf, other]

Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals

Authors: Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, Wei Wang

Abstract: Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. Hallucination arises because LLMs struggle to admit ignorance due to inadequate training on knowledge boundaries. We call it a limitation of LLMs that they can not accurately express their knowledge boundary, answering questions they know while a… ▽ More Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. Hallucination arises because LLMs struggle to admit ignorance due to inadequate training on knowledge boundaries. We call it a limitation of LLMs that they can not accurately express their knowledge boundary, answering questions they know while admitting ignorance to questions they do not know. In this paper, we aim to teach LLMs to recognize and express their knowledge boundary, so they can reduce hallucinations caused by fabricating when they do not know. We propose CoKE, which first probes LLMs' knowledge boundary via internal confidence given a set of questions, and then leverages the probing results to elicit the expression of the knowledge boundary. Extensive experiments show CoKE helps LLMs express knowledge boundaries, answering known questions while declining unknown ones, significantly improving in-domain and out-of-domain performance. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.09179 [pdf, other]

Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning

Authors: Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, Masashi Sugiyama

Abstract: The compelling goal of eradicating undesirable data behaviors, while preserving usual model functioning, underscores the significance of machine unlearning within the domain of large language models (LLMs). Recent research has begun to approach LLM unlearning via gradient ascent (GA) -- increasing the prediction risk for those training strings targeted to be unlearned, thereby erasing their parame… ▽ More The compelling goal of eradicating undesirable data behaviors, while preserving usual model functioning, underscores the significance of machine unlearning within the domain of large language models (LLMs). Recent research has begun to approach LLM unlearning via gradient ascent (GA) -- increasing the prediction risk for those training strings targeted to be unlearned, thereby erasing their parameterized responses. Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning, resulting in various undesirable model behaviors, such as catastrophic forgetting, that diminish their practical utility. In this paper, we suggest a set of metrics that can capture multiple facets of real-world utility and propose several controlling methods that can regulate the extent of excessive unlearning. Accordingly, we suggest a general framework to better reflect the practical efficacy of various unlearning methods -- we begin by controlling the unlearning procedures/unlearned models such that no excessive unlearning occurs and follow by the evaluation for unlearning efficacy. Our experimental analysis on established benchmarks revealed that GA-based methods are far from perfect in practice, as strong unlearning is at the high cost of hindering the model utility. We conclude that there is still a long way towards practical and effective LLM unlearning, and more efforts are required in this field. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.09021 [pdf, other]

doi 10.1145/3637528.3671514

Contextual Distillation Model for Diversified Recommendation

Authors: Fan Li, Xu Si, Shisong Tang, Dingmin Wang, Kunyan Han, Bing Han, Guorui Zhou, Yang Song, Hechang Chen

Abstract: The diversity of recommendation is equally crucial as accuracy in improving user experience. Existing studies, e.g., Determinantal Point Process (DPP) and Maximal Marginal Relevance (MMR), employ a greedy paradigm to iteratively select items that optimize both accuracy and diversity. However, prior methods typically exhibit quadratic complexity, limiting their applications to the re-ranking stage… ▽ More The diversity of recommendation is equally crucial as accuracy in improving user experience. Existing studies, e.g., Determinantal Point Process (DPP) and Maximal Marginal Relevance (MMR), employ a greedy paradigm to iteratively select items that optimize both accuracy and diversity. However, prior methods typically exhibit quadratic complexity, limiting their applications to the re-ranking stage and are not applicable to other recommendation stages with a larger pool of candidate items, such as the pre-ranking and ranking stages. In this paper, we propose Contextual Distillation Model (CDM), an efficient recommendation model that addresses diversification, suitable for the deployment in all stages of industrial recommendation pipelines. Specifically, CDM utilizes the candidate items in the same user request as context to enhance the diversification of the results. We propose a contrastive context encoder that employs attention mechanisms to model both positive and negative contexts. For the training of CDM, we compare each target item with its context embedding and utilize the knowledge distillation framework to learn the win probability of each target item under the MMR algorithm, where the teacher is derived from MMR outputs. During inference, ranking is performed through a linear combination of the recommendation and student model scores, ensuring both diversity and efficiency. We perform offline evaluations on two industrial datasets and conduct online A/B test of CDM on the short-video platform KuaiShou. The considerable enhancements observed in both recommendation quality and diversity, as shown by metrics, provide strong superiority for the effectiveness of CDM. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: accepted by KDD 2024

arXiv:2406.08288 [pdf, other]

Decoupling the Class Label and the Target Concept in Machine Unlearning

Authors: Jianing Zhu, Bo Han, Jiangchao Yao, Jianliang Xu, Gang Niu, Masashi Sugiyama

Abstract: Machine unlearning as an emerging research topic for data regulations, aims to adjust a trained model to approximate a retrained one that excludes a portion of training data. Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class, through gradient ascent on the forgetting data or fine-tuning with the remaining data. However, while these metho… ▽ More Machine unlearning as an emerging research topic for data regulations, aims to adjust a trained model to approximate a retrained one that excludes a portion of training data. Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class, through gradient ascent on the forgetting data or fine-tuning with the remaining data. However, while these methods are useful, they are insufficient as the class label and the target concept are often considered to coincide. In this work, we decouple them by considering the label domain mismatch and investigate three problems beyond the conventional all matched forgetting, e.g., target mismatch, model mismatch, and data mismatch forgetting. We systematically analyze the new challenges in restrictively forgetting the target concept and also reveal crucial forgetting dynamics in the representation level to realize these tasks. Based on that, we propose a general framework, namely, TARget-aware Forgetting (TARF). It enables the additional tasks to actively forget the target concept while maintaining the rest part, by simultaneously conducting annealed gradient ascent on the forgetting data and selected gradient descent on the hard-to-affect remaining data. Empirically, various experiments under the newly introduced settings are conducted to demonstrate the effectiveness of our TARF. △ Less

Submitted 16 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07955 [pdf, other]

How Interpretable Are Interpretable Graph Neural Networks?

Authors: Yongqiang Chen, Yatao Bian, Bo Han, James Cheng

Abstract: Interpretable graph neural networks (XGNNs ) are widely adopted in various scientific applications involving graph-structured data. Existing XGNNs predominantly adopt the attention-based mechanism to learn edge or node importance for extracting and making predictions with the interpretable subgraph. However, the representational properties and limitations of these methods remain inadequately explo… ▽ More Interpretable graph neural networks (XGNNs ) are widely adopted in various scientific applications involving graph-structured data. Existing XGNNs predominantly adopt the attention-based mechanism to learn edge or node importance for extracting and making predictions with the interpretable subgraph. However, the representational properties and limitations of these methods remain inadequately explored. In this work, we present a theoretical framework that formulates interpretable subgraph learning with the multilinear extension of the subgraph distribution, coined as subgraph multilinear extension (SubMT). Extracting the desired interpretable subgraph requires an accurate approximation of SubMT, yet we find that the existing XGNNs can have a huge gap in fitting SubMT. Consequently, the SubMT approximation failure will lead to the degenerated interpretability of the extracted subgraphs. To mitigate the issue, we design a new XGNN architecture called Graph Multilinear neT (GMT), which is provably more powerful in approximating SubMT. We empirically validate our theoretical findings on a number of graph classification benchmarks. The results demonstrate that GMT outperforms the state-of-the-art up to 10% in terms of both interpretability and generalizability across 12 regular and geometric graph benchmarks. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: ICML2024, 44 pages, 21 figures, 12 tables

arXiv:2406.07855 [pdf, other]

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Authors: Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei

Abstract: With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h… ▽ More With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: https://aka.ms/valler. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 15 pages, 5 figures

arXiv:2406.07337 [pdf, other]

Transferring Knowledge from Large Foundation Models to Small Downstream Models

Authors: Shikai Qiu, Boran Han, Danielle C. Maddix, Shuai Zhang, Yuyang Wang, Andrew Gordon Wilson

Abstract: How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn co… ▽ More How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over $50\times$ smaller, and can effectively transfer complementary information learned by multiple pre-trained models. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: ICML 2024. Code available at https://github.com/amazon-science/adaptive-feature-transfer

arXiv:2406.07006 [pdf, other]

MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results

Authors: Xin Jin, Chunle Guo, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Ruoqi Li, Chang Liu, Ziyi Wang, Yao Du, Jingjing Yang, Long Bao, Heng Sun, Xiangyu Kong, Xiaoxia Xing, Jinlong Wu, Yuanyang Xue, Hyunhee Park, Sejun Song, Changho Kim, Jingfan Tan , et al. (17 additional authors not shown)

Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Few-shot RAW Image Denoising track on MIPI 2024. In total, 165 participants were successfully registered, and 7 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art erformance on Few-shot RAW Image Denoising. More details of this challenge and the link to the dataset can be found at https://mipichallenge.org/MIPI2024. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: CVPR 2024 Mobile Intelligent Photography and Imaging (MIPI) Workshop--Few-shot RAWImage Denoising Challenge Report. Website: https://mipi-challenge.org/MIPI2024/

arXiv:2406.04496 [pdf, other]

Time Sensitive Knowledge Editing through Efficient Finetuning

Authors: Xiou Ge, Ali Mousavi, Edouard Grave, Armand Joulin, Kun Qian, Benjamin Han, Mostafa Arefiyan, Yunyao Li

Abstract: Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, keeping the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge e… ▽ More Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, keeping the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge editing (KE) method suffers from two limitations. First, the post-edit LLMs by such methods generally have poor capability in answering complex queries that require multi-hop reasoning. Second, the long run-time of such locate-and-edit methods to perform knowledge edits make it infeasible for large scale KE in practice. In this paper, we explore Parameter-Efficient Fine-Tuning (PEFT) techniques as an alternative for KE. We curate a more comprehensive temporal KE dataset with both knowledge update and knowledge injection examples for KE performance benchmarking. We further probe the effect of fine-tuning on a range of layers in an LLM for the multi-hop QA task. We find that PEFT performs better than locate-and-edit techniques for time-sensitive knowledge edits. △ Less

Submitted 22 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: ACL 2024 main

arXiv:2406.03631 [pdf, other]

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Authors: Dyah Adila, Shuai Zhang, Boran Han, Yuyang Wang

Abstract: The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model's preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify t… ▽ More The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model's preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model's internal representation. Our approach, SteerFair, finds the bias direction in the model's representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Journal ref: ICML 2024

arXiv:2406.00806 [pdf, other]

Envisioning Outlier Exposure by Large Language Models for Out-of-Distribution Detection

Authors: Chentao Cao, Zhun Zhong, Zhanke Zhou, Yang Liu, Tongliang Liu, Bo Han

Abstract: Detecting out-of-distribution (OOD) samples is essential when deploying machine learning models in open-world scenarios. Zero-shot OOD detection, requiring no training on in-distribution (ID) data, has been possible with the advent of vision-language models like CLIP. Existing methods build a text-based classifier with only closed-set labels. However, this largely restricts the inherent capability… ▽ More Detecting out-of-distribution (OOD) samples is essential when deploying machine learning models in open-world scenarios. Zero-shot OOD detection, requiring no training on in-distribution (ID) data, has been possible with the advent of vision-language models like CLIP. Existing methods build a text-based classifier with only closed-set labels. However, this largely restricts the inherent capability of CLIP to recognize samples from large and open label space. In this paper, we propose to tackle this constraint by leveraging the expert knowledge and reasoning capability of large language models (LLM) to Envision potential Outlier Exposure, termed EOE, without access to any actual OOD data. Owing to better adaptation to open-world scenarios, EOE can be generalized to different tasks, including far, near, and fine-grained OOD detection. Technically, we design (1) LLM prompts based on visual similarity to generate potential outlier class labels specialized for OOD detection, as well as (2) a new score function based on potential outlier penalty to distinguish hard OOD samples effectively. Empirically, EOE achieves state-of-the-art performance across different OOD tasks and can be effectively scaled to the ImageNet-1K dataset. The code is publicly available at: https://github.com/tmlr-group/EOE. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: ICML 2024

arXiv:2405.19919 [pdf, other]

Unraveling the Impact of Heterophilic Structures on Graph Positive-Unlabeled Learning

Authors: Yuhao Wu, Jiangchao Yao, Bo Han, Lina Yao, Tongliang Liu

Abstract: While Positive-Unlabeled (PU) learning is vital in many real-world scenarios, its application to graph data still remains under-explored. We unveil that a critical challenge for PU learning on graph lies on the edge heterophily, which directly violates the irreducibility assumption for Class-Prior Estimation (class prior is essential for building PU learning algorithms) and degenerates the latent… ▽ More While Positive-Unlabeled (PU) learning is vital in many real-world scenarios, its application to graph data still remains under-explored. We unveil that a critical challenge for PU learning on graph lies on the edge heterophily, which directly violates the irreducibility assumption for Class-Prior Estimation (class prior is essential for building PU learning algorithms) and degenerates the latent label inference on unlabeled nodes during classifier training. In response to this challenge, we introduce a new method, named Graph PU Learning with Label Propagation Loss (GPL). Specifically, GPL considers learning from PU nodes along with an intermediate heterophily reduction, which helps mitigate the negative impact of the heterophilic structure. We formulate this procedure as a bilevel optimization that reduces heterophily in the inner loop and efficiently learns a classifier in the outer loop. Extensive experiments across a variety of datasets have shown that GPL significantly outperforms baseline methods, confirming its effectiveness and superiority. △ Less

Submitted 1 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: ICML 2024

arXiv:2405.18972 [pdf, other]

Federated Learning with Bilateral Curation for Partially Class-Disjoint Data

Authors: Ziqing Fan, Ruipeng Zhang, Jiangchao Yao, Bo Han, Ya Zhang, Yanfeng Wang

Abstract: Partially class-disjoint data (PCDD), a common yet under-explored data formation where each client contributes a part of classes (instead of all classes) of samples, severely challenges the performance of federated algorithms. Without full classes, the local objective will contradict the global objective, yielding the angle collapse problem for locally missing classes and the space waste problem f… ▽ More Partially class-disjoint data (PCDD), a common yet under-explored data formation where each client contributes a part of classes (instead of all classes) of samples, severely challenges the performance of federated algorithms. Without full classes, the local objective will contradict the global objective, yielding the angle collapse problem for locally missing classes and the space waste problem for locally existing classes. As far as we know, none of the existing methods can intrinsically mitigate PCDD challenges to achieve holistic improvement in the bilateral views (both global view and local view) of federated learning. To address this dilemma, we are inspired by the strong generalization of simplex Equiangular Tight Frame~(ETF) on the imbalanced data, and propose a novel approach called FedGELA where the classifier is globally fixed as a simplex ETF while locally adapted to the personal distributions. Globally, FedGELA provides fair and equal discrimination for all classes and avoids inaccurate updates of the classifier, while locally it utilizes the space of locally missing classes for locally existing classes. We conduct extensive experiments on a range of datasets to demonstrate that our FedGELA achieves promising performance~(averaged improvement of 3.9% to FedAvg and 1.5% to best baselines) and provide both local and global convergence guarantees. Source code is available at:https://github.com/MediaBrain-SJTU/FedGELA.git. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18786 [pdf, other]

MOKD: Cross-domain Finetuning for Few-shot Classification via Maximizing Optimized Kernel Dependence

Authors: Hongduan Tian, Feng Liu, Tongliang Liu, Bo Du, Yiu-ming Cheung, Bo Han

Abstract: In cross-domain few-shot classification, \emph{nearest centroid classifier} (NCC) aims to learn representations to construct a metric space where few-shot classification can be performed by measuring the similarities between samples and the prototype of each class. An intuition behind NCC is that each sample is pulled closer to the class centroid it belongs to while pushed away from those of other… ▽ More In cross-domain few-shot classification, \emph{nearest centroid classifier} (NCC) aims to learn representations to construct a metric space where few-shot classification can be performed by measuring the similarities between samples and the prototype of each class. An intuition behind NCC is that each sample is pulled closer to the class centroid it belongs to while pushed away from those of other classes. However, in this paper, we find that there exist high similarities between NCC-learned representations of two samples from different classes. In order to address this problem, we propose a bi-level optimization framework, \emph{maximizing optimized kernel dependence} (MOKD) to learn a set of class-specific representations that match the cluster structures indicated by labeled data of the given task. Specifically, MOKD first optimizes the kernel adopted in \emph{Hilbert-Schmidt independence criterion} (HSIC) to obtain the optimized kernel HSIC (opt-HSIC) that can capture the dependence more precisely. Then, an optimization problem regarding the opt-HSIC is addressed to simultaneously maximize the dependence between representations and labels and minimize the dependence among all samples. Extensive experiments on Meta-Dataset demonstrate that MOKD can not only achieve better generalization performance on unseen domains in most cases but also learn better data representation clusters. The project repository of MOKD is available at: \href{https://github.com/tmlr-group/MOKD}{https://github.com/tmlr-group/MOKD}. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.16996 [pdf, other]

Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning

Authors: Zihua Zhao, Mengxi Chen, Tianjie Dai, Jiangchao Yao, Bo han, Ya Zhang, Yanfeng Wang

Abstract: Noisy correspondence that refers to mismatches in cross-modal data pairs, is prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually, we find that both structures are effective… ▽ More Noisy correspondence that refers to mismatches in cross-modal data pairs, is prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually, we find that both structures are effective to discriminate noisy correspondence through structural differences when being well-established. Inspired by this observation, we introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence. Specifically, GSC ensures the preservation of geometrical structures within and between modalities, allowing for the accurate discrimination of noisy samples based on structural differences. Utilizing these inferred true correspondence labels, GSC refines the learning of geometrical structures by filtering out the noisy samples. Experiments across four cross-modal datasets confirm that GSC effectively identifies noisy samples and significantly outperforms the current leading methods. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 10 pages, 5 figures, received by IEEE/CVF Computer Science and Pattern Recognition

arXiv:2405.16820 [pdf, other]

doi 10.1145/3630106.3658966

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Authors: Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber, Bill Howe

Abstract: The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-wei… ▽ More The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closed-weight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2024

arXiv:2405.16262 [pdf, other]

Layer-Aware Analysis of Catastrophic Overfitting: Revealing the Pseudo-Robust Shortcut Dependency

Authors: Runqi Lin, Chaojian Yu, Bo Han, Hang Su, Tongliang Liu

Abstract: Catastrophic overfitting (CO) presents a significant challenge in single-step adversarial training (AT), manifesting as highly distorted deep neural networks (DNNs) that are vulnerable to multi-step adversarial attacks. However, the underlying factors that lead to the distortion of decision boundaries remain unclear. In this work, we delve into the specific changes within different DNN layers and… ▽ More Catastrophic overfitting (CO) presents a significant challenge in single-step adversarial training (AT), manifesting as highly distorted deep neural networks (DNNs) that are vulnerable to multi-step adversarial attacks. However, the underlying factors that lead to the distortion of decision boundaries remain unclear. In this work, we delve into the specific changes within different DNN layers and discover that during CO, the former layers are more susceptible, experiencing earlier and greater distortion, while the latter layers show relative insensitivity. Our analysis further reveals that this increased sensitivity in former layers stems from the formation of pseudo-robust shortcuts, which alone can impeccably defend against single-step adversarial attacks but bypass genuine-robust learning, resulting in distorted decision boundaries. Eliminating these shortcuts can partially restore robustness in DNNs from the CO state, thereby verifying that dependence on them triggers the occurrence of CO. This understanding motivates us to implement adaptive weight perturbations across different layers to hinder the generation of pseudo-robust shortcuts, consequently mitigating CO. Extensive experiments demonstrate that our proposed method, Layer-Aware Adversarial Weight Perturbation (LAP), can effectively prevent CO and further enhance robustness. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.12698 [pdf, other]

Correlated magnetism of moiré exciton-polaritons on a triangular electron-spin lattice

Authors: Johannes Scherzer, Lukas Lackner, Bo Han, Borislav Polovnikov, Lukas Husel, Jonas Göser, Zhijie Li, Jens-Christian Drawer, Martin Esmann, Christoph Bennenhei, Falk Eilenberger, Kenji Watanabe, Takashi Taniguchi, Anvar S. Baimuratov, Christian Schneider, Alexander Högele

Abstract: We demonstrate evidence of correlated magnetism for exciton-polaritons in a MoSe$_{2}$/WS$_{2}$ moiré heterostructure with near-parallel alignment subject to electron doping. In our experiments, interactions between electrons and moiré excitons are controlled electrostatically by field-effect doping, and the polaritonic regime of strong light-matter coupling is established in an open cryogenic mic… ▽ More We demonstrate evidence of correlated magnetism for exciton-polaritons in a MoSe$_{2}$/WS$_{2}$ moiré heterostructure with near-parallel alignment subject to electron doping. In our experiments, interactions between electrons and moiré excitons are controlled electrostatically by field-effect doping, and the polaritonic regime of strong light-matter coupling is established in an open cryogenic microcavity. Remarkably, at filling fractions around one electron per moiré cell, we observe drastic and nonlinear enhancement of the effective polariton Landé factor as a hallmark of correlated magnetism, which is cavity-controlled via resonance tuning of light and matter polariton constituents. Our work establishes moiré van der Waals heterostructures as an outstanding platform for studies of correlated phenomena in the presence of strong light-matter coupling and many-body phases of lattice-ordered excitons, charges and spins. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.11969 [pdf, ps, other]

Sobolev regularity theory for stochastic reaction-diffusion-advection equations with spatially homogeneous colored noises and variable-order nonlocal operators

Authors: Jae-Hwan Choi, Beom-Seok Han, Daehan Park

Abstract: This article investigates the existence, uniqueness, and regularity of solutions to nonlinear stochastic reaction-diffusion-advection equations (SRDAEs) with spatially homogeneous colored noises and variable-order nonlocal operators in mixed norm $L_q(L_p)$-spaces. We introduce a new condition (strongly reinforced Dalang's condition) on colored noise, which facilitates a deeper understanding of th… ▽ More This article investigates the existence, uniqueness, and regularity of solutions to nonlinear stochastic reaction-diffusion-advection equations (SRDAEs) with spatially homogeneous colored noises and variable-order nonlocal operators in mixed norm $L_q(L_p)$-spaces. We introduce a new condition (strongly reinforced Dalang's condition) on colored noise, which facilitates a deeper understanding of the complicated relation between nonlinearities and stochastic forces. Additionally, we establish the space-time Hölder type regularity of solutions. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 47 pages

MSC Class: 60H15; 35R60

arXiv:2405.11473 [pdf, other]

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Authors: Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han

Abstract: We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a… ▽ More We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video samples and source codes are available at our project page. △ Less

Submitted 12 June, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

Comments: Project Page: https://jjihwan.github.io/projects/FIFO-Diffusion

arXiv:2405.10422 [pdf, other]

A First Look at Immersive Telepresence on Apple Vision Pro

Authors: Ruizhi Cheng, Nan Wu, Matteo Varvello, Eugene Chai, Songqing Chen, Bo Han

Abstract: Due to the widespread adoption of "work-from-home" policies, videoconferencing applications (e.g., Zoom) have become indispensable for remote communication. However, these systems lack immersiveness, leading to the so-called "Zoom fatigue" and degrading communication efficiency. The recent debut of Apple Vision Pro, a mixed reality headset that supports "spatial persona", aims to offer an immersiv… ▽ More Due to the widespread adoption of "work-from-home" policies, videoconferencing applications (e.g., Zoom) have become indispensable for remote communication. However, these systems lack immersiveness, leading to the so-called "Zoom fatigue" and degrading communication efficiency. The recent debut of Apple Vision Pro, a mixed reality headset that supports "spatial persona", aims to offer an immersive telepresence experience with these applications. In this paper, we conduct a first-of-its-kind in-depth and empirical study to analyze the performance of immersive telepresence with four applications, Apple FaceTime, Cisco Webex, Microsoft Teams, and Zoom, on Vision Pro. We find that only FaceTime provides a truly immersive experience with spatial personas, whereas other applications still operate 2D personas. Our measurement results reveal that (1) FaceTime delivers semantic information to optimize bandwidth consumption, which is even lower than that of 2D persona for other applications, and (2) it employs visibility-aware optimizations to reduce rendering overhead. However, the scalability of FaceTime remains limited, with a simple server allocation strategy that potentially leads to high network delay among users. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09892 [pdf, other]

Balancing Similarity and Complementarity for Federated Learning

Authors: Kunda Yan, Sen Cui, Abudukelimu Wuerkaixi, Jingfeng Zhang, Bo Han, Gang Niu, Masashi Sugiyama, Changshui Zhang

Abstract: In mobile and IoT systems, Federated Learning (FL) is increasingly important for effectively using data while maintaining user privacy. One key challenge in FL is managing statistical heterogeneity, such as non-i.i.d. data, arising from numerous clients and diverse data sources. This requires strategic cooperation, often with clients having similar characteristics. However, we are interested in a… ▽ More In mobile and IoT systems, Federated Learning (FL) is increasingly important for effectively using data while maintaining user privacy. One key challenge in FL is managing statistical heterogeneity, such as non-i.i.d. data, arising from numerous clients and diverse data sources. This requires strategic cooperation, often with clients having similar characteristics. However, we are interested in a fundamental question: does achieving optimal cooperation necessarily entail cooperating with the most similar clients? Typically, significant model performance improvements are often realized not by partnering with the most similar models, but through leveraging complementary data. Our theoretical and empirical analyses suggest that optimal cooperation is achieved by enhancing complementarity in feature distribution while restricting the disparity in the correlation between features and targets. Accordingly, we introduce a novel framework, \texttt{FedSaC}, which balances similarity and complementarity in FL cooperation. Our framework aims to approximate an optimal cooperation network for each client by optimizing a weighted sum of model similarity and feature complementarity. The strength of \texttt{FedSaC} lies in its adaptability to various levels of data heterogeneity and multimodal scenarios. Our comprehensive unimodal and multimodal experiments demonstrate that \texttt{FedSaC} markedly surpasses other state-of-the-art FL methods. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09245 [pdf, other]

A Robust UAV-Based Approach for Power-Modulated Jammer Localization Using DoA

Authors: Zexin Fang, Bin Han, Hans D. Schotten

Abstract: Unmanned aerial vehicles (UAVs) are well-suited to localize jammers, particularly when jammers are at non-terrestrial locations, where conventional detection methods face challenges. In this work we propose a novel localization method, sample pruning gradient descend (SPGD), which offers robust performance against multiple power-modulated jammers with low computational complexity. Unmanned aerial vehicles (UAVs) are well-suited to localize jammers, particularly when jammers are at non-terrestrial locations, where conventional detection methods face challenges. In this work we propose a novel localization method, sample pruning gradient descend (SPGD), which offers robust performance against multiple power-modulated jammers with low computational complexity. △ Less

Submitted 21 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: Submitted to the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall)

arXiv:2405.07780 [pdf, other]

Harnessing Hierarchical Label Distribution Variations in Test Agnostic Long-tail Recognition

Authors: Zhiyong Yang, Qianqian Xu, Zitai Wang, Sicong Li, Boyu Han, Shilong Bao, Xiaochun Cao, Qingming Huang

Abstract: This paper explores test-agnostic long-tail recognition, a challenging long-tail task where the test label distributions are unknown and arbitrarily imbalanced. We argue that the variation in these distributions can be broken down hierarchically into global and local levels. The global ones reflect a broad range of diversity, while the local ones typically arise from milder changes, often focused… ▽ More This paper explores test-agnostic long-tail recognition, a challenging long-tail task where the test label distributions are unknown and arbitrarily imbalanced. We argue that the variation in these distributions can be broken down hierarchically into global and local levels. The global ones reflect a broad range of diversity, while the local ones typically arise from milder changes, often focused on a particular neighbor. Traditional methods predominantly use a Mixture-of-Expert (MoE) approach, targeting a few fixed test label distributions that exhibit substantial global variations. However, the local variations are left unconsidered. To address this issue, we propose a new MoE strategy, $\mathsf{DirMixE}$, which assigns experts to different Dirichlet meta-distributions of the label distribution, each targeting a specific aspect of local variations. Additionally, the diversity among these Dirichlet meta-distributions inherently captures global variations. This dual-level approach also leads to a more stable objective function, allowing us to sample different test distributions better to quantify the mean and variance of performance outcomes. Theoretically, we show that our proposed objective benefits from enhanced generalization by virtue of the variance-based regularization. Comprehensive experiments across multiple benchmarks confirm the effectiveness of $\mathsf{DirMixE}$. The code is available at \url{https://github.com/scongl/DirMixE}. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.04867 [pdf, other]

MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhijing Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Haijin Zeng, Kai Feng , et al. (24 additional authors not shown)

Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

arXiv:2404.17729 [pdf, other]

CoMM: Collaborative Multi-Agent, Multi-Reasoning-Path Prompting for Complex Problem Solving

Authors: Pei Chen, Boran Han, Shuai Zhang

Abstract: Large Language Models (LLMs) have shown great ability in solving traditional natural language tasks and elementary reasoning tasks with appropriate prompting techniques. However, their ability is still limited in solving complicated science problems. In this work, we aim to push the upper bound of the reasoning capability of LLMs by proposing a collaborative multi-agent, multi-reasoning-path (CoMM… ▽ More Large Language Models (LLMs) have shown great ability in solving traditional natural language tasks and elementary reasoning tasks with appropriate prompting techniques. However, their ability is still limited in solving complicated science problems. In this work, we aim to push the upper bound of the reasoning capability of LLMs by proposing a collaborative multi-agent, multi-reasoning-path (CoMM) prompting framework. Specifically, we prompt LLMs to play different roles in a problem-solving team, and encourage different role-play agents to collaboratively solve the target task. In particular, we discover that applying different reasoning paths for different roles is an effective strategy to implement few-shot prompting approaches in the multi-agent scenarios. Empirical results demonstrate the effectiveness of the proposed methods on two college-level science problems over competitive baselines. Our further analysis shows the necessity of prompting LLMs to play different roles or experts independently. We release the code at: https://github.com/amazon-science/comm-prompt △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: Accepted to NAACL 2024

arXiv:2404.16880 [pdf, other]

Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

Authors: Yikun Zhang, Geyan Ye, Chaohao Yuan, Bo Han, Long-Kai Huang, Jianhua Yao, Wei Liu, Yu Rong

Abstract: Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields, including drug discovery and materials science. Existing studies adopt a global alignment approach to learn the knowledge from different modalities. These global alignment approaches fail to cap… ▽ More Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields, including drug discovery and materials science. Existing studies adopt a global alignment approach to learn the knowledge from different modalities. These global alignment approaches fail to capture fine-grained information, such as molecular fragments and their corresponding textual description, which is crucial for downstream tasks. Furthermore, it is incapable to model such information using a similar global alignment strategy due to data scarcity of paired local part annotated data from existing datasets. In this paper, we propose Atomas, a multi-modal molecular representation learning framework to jointly learn representations from SMILES string and text. We design a Hierarchical Adaptive Alignment model to concurrently learn the fine-grained fragment correspondence between two modalities and align these representations of fragments in three levels. Additionally, Atomas's end-to-end training framework incorporates the tasks of understanding and generating molecule, thereby supporting a wider range of downstream tasks. In the retrieval task, Atomas exhibits robust generalization ability and outperforms the baseline by 30.8% of recall@1 on average. In the generation task, Atomas achieves state-of-the-art results in both molecule captioning task and molecule generation task. Moreover, the visualization of the Hierarchical Adaptive Alignment model further confirms the chemical significance of our approach. Our codes can be found at https://anonymous.4open.science/r/Atomas-03C3. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.16484 [pdf, other]

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024, AI for Streaming (AIS) Workshop

arXiv:2404.12886 [pdf, other]

MCM: Multi-condition Motion Synthesis Framework

Authors: Zeyu Ling, Bo Han, Yongkang Wongkan, Han Lin, Mohan Kankanhalli, Weidong Geng

Abstract: Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framewor… ▽ More Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech HMS while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Journal ref: International Joint Conference on Artificial Intelligence 2024

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2404.09960 [pdf, other]

Pseudo P-values for Assessing Covariate Balance in a Finite Study Population with Application to the California Sugar Sweetened Beverage Tax Study

Authors: Bing Han, Margo A. Sidell

Abstract: Assessing covariate balance (CB) is a common practice in various types of evaluation studies. Two-sample descriptive statistics, such as the standardized mean difference, have been widely applied in the scientific literature to assess the goodness of CB. Studies in health policy, health services research, built and social environment research, and many other fields often involve a finite number of… ▽ More Assessing covariate balance (CB) is a common practice in various types of evaluation studies. Two-sample descriptive statistics, such as the standardized mean difference, have been widely applied in the scientific literature to assess the goodness of CB. Studies in health policy, health services research, built and social environment research, and many other fields often involve a finite number of units that may be subject to different treatment levels. Our case study, the California Sugar Sweetened Beverage (SSB) Tax Study, include 332 study cities in the state of California, among which individual cities may elect to levy a city-wide excise tax on SSB sales. Evaluating the balance of covariates between study cities with and without the tax policy is essential for assessing the effects of the policy on health outcomes of interest. In this paper, we introduce the novel concepts of the pseudo p-value and the standardized pseudo p-value, which are descriptive statistics to assess the overall goodness of CB between study arms in a finite study population. While not meant as a hypothesis test, the pseudo p-values bear superficial similarity to the classic p-value, which makes them easy to apply and interpret in applications. We discuss some theoretical properties of the pseudo p-values and present an algorithm to calculate them. We report a numerical simulation study to demonstrate their performance. We apply the pseudo p-values to the California SSB Tax study to assess the balance of city-level characteristics between the two study arms. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 26 pages in total, 2 figures, 6 tables

arXiv:2404.09790 [pdf, other]

NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results

Authors: Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou , et al. (63 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge i… ▽ More This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

Showing 1–50 of 675 results for author: Han, B