-
GridPE: Unifying Positional Encoding in Transformers with a Grid Cell-Inspired Framework
Authors:
Boyang Li,
Yulin Wu,
Nuoxian Huang
Abstract:
Understanding spatial location and relationships is a fundamental capability for modern artificial intelligence systems. Insights from human spatial cognition provide valuable guidance in this domain. Recent neuroscientific discoveries have highlighted the role of grid cells as a fundamental neural component for spatial representation, including distance computation, path integration, and scale di…
▽ More
Understanding spatial location and relationships is a fundamental capability for modern artificial intelligence systems. Insights from human spatial cognition provide valuable guidance in this domain. Recent neuroscientific discoveries have highlighted the role of grid cells as a fundamental neural component for spatial representation, including distance computation, path integration, and scale discernment. In this paper, we introduce a novel positional encoding scheme inspired by Fourier analysis and the latest findings in computational neuroscience regarding grid cells. Assuming that grid cells encode spatial position through a summation of Fourier basis functions, we demonstrate the translational invariance of the grid representation during inner product calculations. Additionally, we derive an optimal grid scale ratio for multi-dimensional Euclidean spaces based on principles of biological efficiency. Utilizing these computational principles, we have developed a **Grid**-cell inspired **Positional Encoding** technique, termed **GridPE**, for encoding locations within high-dimensional spaces. We integrated GridPE into the Pyramid Vision Transformer architecture. Our theoretical analysis shows that GridPE provides a unifying framework for positional encoding in arbitrary high-dimensional spaces. Experimental results demonstrate that GridPE significantly enhances the performance of transformers, underscoring the importance of incorporating neuroscientific insights into the design of artificial intelligence systems.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving
Authors:
Nan Huang,
Xiaobao Wei,
Wenzhao Zheng,
Pengju An,
Ming Lu,
Wei Zhan,
Masayoshi Tomizuka,
Kurt Keutzer,
Shanghang Zhang
Abstract:
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle b…
▽ More
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: https://github.com/nnanhuang/S3Gaussian/.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series Classification
Authors:
Yihe Wang,
Nan Huang,
Taida Li,
Yujun Yan,
Xiang Zhang
Abstract:
Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we int…
▽ More
Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we introduce Medformer, a multi-granularity patching transformer tailored specifically for medical time series classification. Our method incorporates three novel mechanisms to leverage the unique characteristics of medical time series: cross-channel patching to leverage inter-channel correlations, multi-granularity embedding for capturing features at different scales, and two-stage (intra- and inter-granularity) multi-granularity self-attention for learning features and correlations within and among granularities. We conduct extensive experiments on five public datasets under both subject-dependent and challenging subject-independent setups. Results demonstrate Medformer's superiority over 10 baselines, achieving top averaged ranking across five datasets on all six evaluation metrics. These findings underscore the significant impact of our method on healthcare applications, such as diagnosing Myocardial Infarction, Alzheimer's, and Parkinson's disease. We release the source code at \url{https://github.com/DL4mHealth/Medformer}.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
UnitNorm: Rethinking Normalization for Transformers in Time Series
Authors:
Nan Huang,
Christian Kümmerle,
Xiang Zhang
Abstract:
Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumv…
▽ More
Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumventing these challenges. Grounded in existing normalization frameworks, UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection, via a rigorous evaluation on 6 state-of-the-art models and 10 datasets. Notably, UnitNorm shows superior performance, especially in scenarios requiring robust attention mechanisms and contextual comprehension, evidenced by significant improvements by up to a 1.46 decrease in MSE for forecasting, and a 4.89% increase in accuracy for classification. This work not only calls for a reevaluation of normalization strategies in time series Transformers but also sets a new direction for enhancing model performance and stability. The source code is available at https://anonymous.4open.science/r/UnitNorm-5B84.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Learning functions on symmetric matrices and point clouds via lightweight invariant features
Authors:
Ben Blum-Smith,
Ningyuan Huang,
Marco Cuturi,
Soledad Villar
Abstract:
In this work, we present a mathematical formulation for machine learning of (1) functions on symmetric matrices that are invariant with respect to the action of permutations by conjugation, and (2) functions on point clouds that are invariant with respect to rotations, reflections, and permutations of the points. To achieve this, we construct $O(n^2)$ invariant features derived from generators for…
▽ More
In this work, we present a mathematical formulation for machine learning of (1) functions on symmetric matrices that are invariant with respect to the action of permutations by conjugation, and (2) functions on point clouds that are invariant with respect to rotations, reflections, and permutations of the points. To achieve this, we construct $O(n^2)$ invariant features derived from generators for the field of rational functions on $n\times n$ symmetric matrices that are invariant under joint permutations of rows and columns. We show that these invariant features can separate all distinct orbits of symmetric matrices except for a measure zero set; such features can be used to universally approximate invariant functions on almost all weighted graphs. For point clouds in a fixed dimension, we prove that the number of invariant features can be reduced, generically without losing expressivity, to $O(n)$, where $n$ is the number of points. We combine these invariant features with DeepSets to learn functions on symmetric matrices and point clouds with varying sizes. We empirically demonstrate the feasibility of our approach on molecule property regression and point cloud distance prediction.
△ Less
Submitted 15 May, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Salient Object Detection From Arbitrary Modalities
Authors:
Nianchang Huang,
Yang Yang,
Ruida Xi,
Qiang Zhang,
Jungong Han,
Jin Huang
Abstract:
Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance…
▽ More
Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, ı.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection.
△ Less
Submitted 9 May, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Modality Prompts for Arbitrary Modality Salient Object Detection
Authors:
Nianchang Huang,
Yang Yang,
Qiang Zhang,
Jungong Han,
Jin Huang
Abstract:
This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be…
▽ More
This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
An inexact augmented Lagrangian algorithm for unsymmetric saddle-point systems
Authors:
N. Huang,
Y. -H. Dai,
D. Orban,
M. A. Saunders
Abstract:
Augmented Lagrangian (AL) methods are a well known class of algorithms for solving constrained optimization problems. They have been extended to the solution of saddle-point systems of linear equations. We study an AL (SPAL) algorithm for unsymmetric saddle-point systems and derive convergence and semi-convergence properties, even when the system is singular. At each step, our SPAL requires the ex…
▽ More
Augmented Lagrangian (AL) methods are a well known class of algorithms for solving constrained optimization problems. They have been extended to the solution of saddle-point systems of linear equations. We study an AL (SPAL) algorithm for unsymmetric saddle-point systems and derive convergence and semi-convergence properties, even when the system is singular. At each step, our SPAL requires the exact solution of a linear system of the same size but with an SPD (2,2) block. To improve efficiency, we introduce an inexact SPAL algorithm. We establish its convergence properties under reasonable assumptions. Specifically, we use a gradient method, known as the Barzilai-Borwein (BB) method, to solve the linear system at each iteration. We call the result the augmented Lagrangian BB (SPALBB) algorithm and study its convergence. Numerical experiments on test problems from Navier-Stokes equations and coupled Stokes-Darcy flow show that SPALBB is more robust and efficient than BICGSTAB and GMRES. SPALBB often requires the least CPU time, especially on large systems.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval
Authors:
He Zhu,
Wenjia Zhang,
Nuoxian Huang,
Boyang Li,
Luyao Niu,
Zipei Fan,
Tianle Lun,
Yicheng Tao,
Junyou Su,
Zhaoya Gong,
Chenyu Fang,
Xing Liu
Abstract:
In the field of urban planning, general-purpose large language models often struggle to meet the specific needs of planners. Tasks like generating urban planning texts, retrieving related information, and evaluating planning documents pose unique challenges. To enhance the efficiency of urban professionals and overcome these obstacles, we introduce PlanGPT, the first specialized Large Language Mod…
▽ More
In the field of urban planning, general-purpose large language models often struggle to meet the specific needs of planners. Tasks like generating urban planning texts, retrieving related information, and evaluating planning documents pose unique challenges. To enhance the efficiency of urban professionals and overcome these obstacles, we introduce PlanGPT, the first specialized Large Language Model tailored for urban and spatial planning. Developed through collaborative efforts with institutions like the Chinese Academy of Urban Planning, PlanGPT leverages a customized local database retrieval framework, domain-specific fine-tuning of base models, and advanced tooling capabilities. Empirical tests demonstrate that PlanGPT has achieved advanced performance, delivering responses of superior quality precisely tailored to the intricacies of urban planning.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Bias in Opinion Summarisation from Pre-training to Adaptation: A Case Study in Political Bias
Authors:
Nannan Huang,
Haytham Fayek,
Xiuzhen Zhang
Abstract:
Opinion summarisation aims to summarise the salient information and opinions presented in documents such as product reviews, discussion forums, and social media texts into short summaries that enable users to effectively understand the opinions therein. Generating biased summaries has the risk of potentially swaying public opinion. Previous studies focused on studying bias in opinion summarisation…
▽ More
Opinion summarisation aims to summarise the salient information and opinions presented in documents such as product reviews, discussion forums, and social media texts into short summaries that enable users to effectively understand the opinions therein. Generating biased summaries has the risk of potentially swaying public opinion. Previous studies focused on studying bias in opinion summarisation using extractive models, but limited research has paid attention to abstractive summarisation models. In this study, using political bias as a case study, we first establish a methodology to quantify bias in abstractive models, then trace it from the pre-trained models to the task of summarising social media opinions using different models and adaptation methods. We find that most models exhibit intrinsic bias. Using a social media text summarisation dataset and contrasting various adaptation methods, we find that tuning a smaller number of parameters is less biased compared to standard fine-tuning; however, the diversity of topics in training data used for fine-tuning is critical.
△ Less
Submitted 31 January, 2024;
originally announced February 2024.
-
CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion
Authors:
Nisha Huang,
Weiming Dong,
Yuxin Zhang,
Fan Tang,
Ronghui Li,
Chongyang Ma,
Xiu Li,
Changsheng Xu
Abstract:
Large-scale text-to-image generative models have made impressive strides, showcasing their ability to synthesize a vast array of high-quality images. However, adapting these models for artistic image editing presents two significant challenges. Firstly, users struggle to craft textual prompts that meticulously detail visual elements of the input image. Secondly, prevalent models, when effecting mo…
▽ More
Large-scale text-to-image generative models have made impressive strides, showcasing their ability to synthesize a vast array of high-quality images. However, adapting these models for artistic image editing presents two significant challenges. Firstly, users struggle to craft textual prompts that meticulously detail visual elements of the input image. Secondly, prevalent models, when effecting modifications in specific zones, frequently disrupt the overall artistic style, complicating the attainment of cohesive and aesthetically unified artworks. To surmount these obstacles, we build the innovative unified framework CreativeSynth, which is based on a diffusion model with the ability to coordinate multimodal inputs and multitask in the field of artistic image generation. By integrating multimodal features with customized attention mechanisms, CreativeSynth facilitates the importation of real-world semantic content into the domain of art through inversion and real-time style transfer. This allows for the precise manipulation of image style and content while maintaining the integrity of the original model parameters. Rigorous qualitative and quantitative evaluations underscore that CreativeSynth excels in enhancing artistic images' fidelity and preserves their innate aesthetic essence. By bridging the gap between generative models and artistic finesse, CreativeSynth becomes a custom digital palette.
△ Less
Submitted 30 January, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
PPO-Clip Attains Global Optimality: Towards Deeper Understandings of Clipping
Authors:
Nai-Chieh Huang,
Ping-Chun Hsieh,
Kuo-Hao Ho,
I-Chen Wu
Abstract:
Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural func…
▽ More
Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our findings highlight the $O(1/\sqrt{T})$ min-iterate convergence rate specifically in the context of neural function approximation. We tackle the inherent challenges in analyzing PPO-Clip through three central concepts: (i) We introduce a generalized version of the PPO-Clip objective, illuminated by its connection with the hinge loss. (ii) Employing entropic mirror descent, we establish asymptotic convergence for tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the tabular analysis, we streamline convergence analysis by introducing a two-step policy improvement approach. This decouples policy search from complex neural policy parameterization using a regression-based update scheme. Furthermore, we gain deeper insights into the efficacy of PPO-Clip by interpreting these generalized objectives. Our theoretical findings also mark the first characterization of the influence of the clipping mechanism on PPO-Clip convergence. Importantly, the clipping range affects only the pre-constant of the convergence rate.
△ Less
Submitted 19 February, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Customize-It-3D: High-Quality 3D Creation from A Single Image Using Subject-Specific Knowledge Prior
Authors:
Nan Huang,
Ting Zhang,
Yuhui Yuan,
Dong Chen,
Shanghang Zhang
Abstract:
In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation. While previous approaches primarily rely on a general diffusion prior, which struggles to yield consistent results with the reference image, we propose a subject-specific and multi-modal diffusion model. This…
▽ More
In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation. While previous approaches primarily rely on a general diffusion prior, which struggles to yield consistent results with the reference image, we propose a subject-specific and multi-modal diffusion model. This model not only aids NeRF optimization by considering the shading mode for improved geometry but also enhances texture from the coarse results to achieve superior refinement. Both aspects contribute to faithfully aligning the 3D content with the subject. Extensive experiments showcase the superiority of our method, Customize-It-3D, outperforming previous works by a substantial margin. It produces faithful 360-degree reconstructions with impressive visual quality, making it well-suited for various applications, including text-to-3D creation.
△ Less
Submitted 9 January, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
MotionCrafter: One-Shot Motion Customization of Diffusion Models
Authors:
Yuxin Zhang,
Fan Tang,
Nisha Huang,
Haibin Huang,
Chongyang Ma,
Weiming Dong,
Changsheng Xu
Abstract:
The essence of a video lies in its dynamic motions, including character actions, object movements, and camera movements. While text-to-video generative diffusion models have recently advanced in creating diverse contents, controlling specific motions through text prompts remains a significant challenge. A primary issue is the coupling of appearance and motion, often leading to overfitting on appea…
▽ More
The essence of a video lies in its dynamic motions, including character actions, object movements, and camera movements. While text-to-video generative diffusion models have recently advanced in creating diverse contents, controlling specific motions through text prompts remains a significant challenge. A primary issue is the coupling of appearance and motion, often leading to overfitting on appearance. To tackle this challenge, we introduce MotionCrafter, a novel one-shot instance-guided motion customization method. MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model, while the spatial module is independently adjusted for character or style control. To enhance the disentanglement of motion and appearance, we propose an innovative dual-branch motion disentanglement approach, comprising a motion disentanglement loss and an appearance prior enhancement strategy. During training, a frozen base model provides appearance normalization, effectively separating appearance from motion and thereby preserving diversity. Comprehensive quantitative and qualitative experiments, along with user preference tests, demonstrate that MotionCrafter can successfully integrate dynamic motions while preserving the coherence and quality of the base model with a wide range of appearance generation capabilities. Project page: https://zyxelsa.github.io/homepage-motioncrafter. Codes are available at https://github.com/zyxElsa/MotionCrafter.
△ Less
Submitted 2 January, 2024; v1 submitted 8 December, 2023;
originally announced December 2023.
-
FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer
Authors:
Chi-Chih Chang,
Yuan-Yao Sung,
Shixing Yu,
Ning-Chi Huang,
Diana Marculescu,
Kai-Chiang Wu
Abstract:
Vision Transformers (ViT) have recently demonstrated success across a myriad of computer vision tasks. However, their elevated computational demands pose significant challenges for real-world deployment. While low-rank approximation stands out as a renowned method to reduce computational loads, efficiently automating the target rank selection in ViT remains a challenge. Drawing from the notable si…
▽ More
Vision Transformers (ViT) have recently demonstrated success across a myriad of computer vision tasks. However, their elevated computational demands pose significant challenges for real-world deployment. While low-rank approximation stands out as a renowned method to reduce computational loads, efficiently automating the target rank selection in ViT remains a challenge. Drawing from the notable similarity and alignment between the processes of rank selection and One-Shot NAS, we introduce FLORA, an end-to-end automatic framework based on NAS. To overcome the design challenge of supernet posed by vast search space, FLORA employs a low-rank aware candidate filtering strategy. This method adeptly identifies and eliminates underperforming candidates, effectively alleviating potential undertraining and interference among subnetworks. To further enhance the quality of low-rank supernets, we design a low-rank specific training paradigm. First, we propose weight inheritance to construct supernet and enable gradient sharing among low-rank modules. Secondly, we adopt low-rank aware sampling to strategically allocate training resources, taking into account inherited information from pre-trained models. Empirical results underscore FLORA's efficacy. With our method, a more fine-grained rank configuration can be generated automatically and yield up to 33% extra FLOPs reduction compared to a simple uniform configuration. More specific, FLORA-DeiT-B/FLORA-Swin-B can save up to 55%/42% FLOPs almost without performance degradtion. Importantly, FLORA boasts both versatility and orthogonality, offering an extra 21%-26% FLOPs reduction when integrated with leading compression techniques or compact hybrid structures. Our code is publicly available at https://github.com/shadowpa0327/FLORA.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Tight approximability of MAX 2-SAT and relatives, under UGC
Authors:
Joshua Brakensiek,
Neng Huang,
Uri Zwick
Abstract:
Austrin showed that the approximation ratio $β\approx 0.94016567$ obtained by the MAX 2-SAT approximation algorithm of Lewin, Livnat and Zwick (LLZ) is optimal modulo the Unique Games Conjecture (UGC) and modulo a Simplicity Conjecture that states that the worst performance of the algorithm is obtained on so called simple configurations. We prove Austrin's conjecture, thereby showing the optimalit…
▽ More
Austrin showed that the approximation ratio $β\approx 0.94016567$ obtained by the MAX 2-SAT approximation algorithm of Lewin, Livnat and Zwick (LLZ) is optimal modulo the Unique Games Conjecture (UGC) and modulo a Simplicity Conjecture that states that the worst performance of the algorithm is obtained on so called simple configurations. We prove Austrin's conjecture, thereby showing the optimality of the LLZ approximation algorithm, relying only on the Unique Games Conjecture. Our proof uses a combination of analytic and computational tools.
We also present new approximation algorithms for two restrictions of the MAX 2-SAT problem. For MAX HORN-$\{1,2\}$-SAT, i.e., MAX CSP$(\{x\lor y,\bar{x}\lor y,x,\bar{x}\})$, in which clauses are not allowed to contain two negated literals, we obtain an approximation ratio of $0.94615981$. For MAX CSP$(\{x\lor y,x,\bar{x}\})$, i.e., when 2-clauses are not allowed to contain negated literals, we obtain an approximation ratio of $0.95397990$. By adapting Austrin's and our arguments for the MAX 2-SAT problem we show that these two approximation ratios are also tight, modulo only the UGC conjecture. This completes a full characterization of the approximability of the MAX 2-SAT problem and its restrictions.
△ Less
Submitted 1 November, 2023; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning
Authors:
Yen-Ju Chen,
Nai-Chieh Huang,
Ching-Pei Lee,
Ping-Chun Hsieh
Abstract:
Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Pol…
▽ More
Various acceleration approaches for Policy Gradient (PG) have been analyzed within the realm of Reinforcement Learning (RL). However, the theoretical understanding of the widely used momentum-based acceleration method on PG remains largely open. In response to this gap, we adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in RL, termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving fast convergence, we formally prove that with the true gradient and under the softmax policy parametrization, APG converges to an optimal policy at rates: (i) $\tilde{O}(1/t^2)$ with constant step sizes; (ii) $O(e^{-ct})$ with exponentially-growing step sizes. To the best of our knowledge, this is the first characterization of the convergence rates of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the parameter initialization, APG ends up entering a locally nearly-concave regime, where APG can significantly benefit from the momentum, within finite iterations. Through numerical validation and experiments on the Atari 2600 benchmarks, we confirm that APG exhibits a $\tilde{O}(1/t^2)$ rate with constant step sizes and a linear convergence rate with exponentially-growing step sizes, significantly improving convergence over the standard PG.
△ Less
Submitted 6 June, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Local algorithms and the failure of log-depth quantum advantage on sparse random CSPs
Authors:
Antares Chen,
Neng Huang,
Kunal Marwaha
Abstract:
We construct and analyze a message-passing algorithm for random constraint satisfaction problems (CSPs) at large clause density, generalizing work of El Alaoui, Montanari, and Sellke for Maximum Cut [arXiv:2111.06813] through a connection between random CSPs and mean-field Ising spin glasses. For CSPs with even predicates, the algorithm asymptotically solves a stochastic optimal control problem du…
▽ More
We construct and analyze a message-passing algorithm for random constraint satisfaction problems (CSPs) at large clause density, generalizing work of El Alaoui, Montanari, and Sellke for Maximum Cut [arXiv:2111.06813] through a connection between random CSPs and mean-field Ising spin glasses. For CSPs with even predicates, the algorithm asymptotically solves a stochastic optimal control problem dual to an extended Parisi variational principle. This gives an optimal fraction of satisfied constraints among algorithms obstructed by the branching overlap gap property of Huang and Sellke [arXiv:2110.07847], notably including the Quantum Approximate Optimization Algorithm and all quantum circuits on a bounded-degree architecture of up to $ε\cdot \log n$ depth.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Approximately Equivariant Graph Networks
Authors:
Ningyuan Huang,
Ron Levie,
Soledad Villar
Abstract:
Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting…
▽ More
Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting on the image signals (sometimes known as active symmetries), whereas in GNNs any permutation acts on both the graph signals and the graph domain (sometimes described as passive symmetries). In this work, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We present a bias-variance formula that quantifies the tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group. To illustrate our approach, we conduct extensive experiments on image inpainting, traffic flow prediction, and human pose estimation with different choices of symmetries. We show theoretically and empirically that the best generalization performance can be achieved by choosing a suitably larger group than the graph automorphism, but smaller than the permutation group.
△ Less
Submitted 17 November, 2023; v1 submitted 20 August, 2023;
originally announced August 2023.
-
Examining Bias in Opinion Summarisation Through the Perspective of Opinion Diversity
Authors:
Nannan Huang,
Lin Tian,
Haytham Fayek,
Xiuzhen Zhang
Abstract:
Opinion summarisation is a task that aims to condense the information presented in the source documents while retaining the core message and opinions. A summary that only represents the majority opinions will leave the minority opinions unrepresented in the summary. In this paper, we use the stance towards a certain target as an opinion. We study bias in opinion summarisation from the perspective…
▽ More
Opinion summarisation is a task that aims to condense the information presented in the source documents while retaining the core message and opinions. A summary that only represents the majority opinions will leave the minority opinions unrepresented in the summary. In this paper, we use the stance towards a certain target as an opinion. We study bias in opinion summarisation from the perspective of opinion diversity, which measures whether the model generated summary can cover a diverse set of opinions. In addition, we examine opinion similarity, a measure of how closely related two opinions are in terms of their stance on a given topic, and its relationship with opinion diversity. Through the lens of stances towards a topic, we examine opinion diversity and similarity using three debatable topics under COVID-19. Experimental results on these topics revealed that a higher degree of similarity of opinions did not indicate good diversity or fairly cover the various opinions originally presented in the source documents. We found that BART and ChatGPT can better capture diverse opinions presented in the source documents.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Fine-grained Expressivity of Graph Neural Networks
Authors:
Jan Böker,
Ron Levie,
Ningyuan Huang,
Soledad Villar,
Christopher Morris
Abstract:
Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler-Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resol…
▽ More
Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler-Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both $1$-WL and MPNNs to graphons. Concretely, we show that the continuous variant of $1$-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the $1$-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the $1$-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous $1$-WL test in understanding MPNNs' expressivity.
△ Less
Submitted 2 November, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models
Authors:
Yuxin Zhang,
Weiming Dong,
Fan Tang,
Nisha Huang,
Haibin Huang,
Chongyang Ma,
Tong-Yee Lee,
Oliver Deussen,
Changsheng Xu
Abstract:
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to…
▽ More
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called \sysname. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available athttps://github.com/zyxElsa/ProSpect.
△ Less
Submitted 7 December, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Contrastive State Augmentations for Reinforcement Learning-Based Recommender Systems
Authors:
Zhaochun Ren,
Na Huang,
Yidan Wang,
Pengjie Ren,
Jun Ma,
Jiahuan Lei,
Xinlei Shi,
Hengliang Luo,
Joemon M Jose,
Xin Xin
Abstract:
Learning reinforcement learning (RL)-based recommenders from historical user-item interaction sequences is vital to generate high-reward recommendations and improve long-term cumulative benefits. However, existing RL recommendation methods encounter difficulties (i) to estimate the value functions for states which are not contained in the offline training data, and (ii) to learn effective state re…
▽ More
Learning reinforcement learning (RL)-based recommenders from historical user-item interaction sequences is vital to generate high-reward recommendations and improve long-term cumulative benefits. However, existing RL recommendation methods encounter difficulties (i) to estimate the value functions for states which are not contained in the offline training data, and (ii) to learn effective state representations from user implicit feedback due to the lack of contrastive signals. In this work, we propose contrastive state augmentations (CSA) for the training of RL-based recommender systems. To tackle the first issue, we propose four state augmentation strategies to enlarge the state space of the offline data. The proposed method improves the generalization capability of the recommender by making the RL agent visit the local state regions and ensuring the learned value functions are similar between the original and augmented states. For the second issue, we propose introducing contrastive signals between augmented states and the state randomly sampled from other sessions to improve the state representation learning further. To verify the effectiveness of the proposed CSA, we conduct extensive experiments on two publicly accessible datasets and one dataset collected from a real-life e-commerce platform. We also conduct experiments on a simulated environment as the online evaluation setting. Experimental results demonstrate that CSA can effectively improve recommendation performance.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Integrating Multiple Sources Knowledge for Class Asymmetry Domain Adaptation Segmentation of Remote Sensing Images
Authors:
Kuiliang Gao,
Anzhu Yu,
Xiong You,
Wenyue Guo,
Ke Li,
Ningbo Huang
Abstract:
In the existing unsupervised domain adaptation (UDA) methods for remote sensing images (RSIs) semantic segmentation, class symmetry is an widely followed ideal assumption, where the source and target RSIs have exactly the same class space. In practice, however, it is often very difficult to find a source RSI with exactly the same classes as the target RSI. More commonly, there are multiple source…
▽ More
In the existing unsupervised domain adaptation (UDA) methods for remote sensing images (RSIs) semantic segmentation, class symmetry is an widely followed ideal assumption, where the source and target RSIs have exactly the same class space. In practice, however, it is often very difficult to find a source RSI with exactly the same classes as the target RSI. More commonly, there are multiple source RSIs available. To this end, a novel class asymmetry RSIs domain adaptation method with multiple sources is proposed in this paper, which consists of four key components. Firstly, a multi-branch segmentation network is built to learn an expert for each source RSI. Secondly, a novel collaborative learning method with the cross-domain mixing strategy is proposed, to supplement the class information for each source while achieving the domain adaptation of each source-target pair. Thirdly, a pseudo-label generation strategy is proposed to effectively combine strengths of different experts, which can be flexibly applied to two cases where the source class union is equal to or includes the target class set. Fourthly, a multiview-enhanced knowledge integration module is developed for the high-level knowledge routing and transfer from multiple domains to target predictions.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer
Authors:
Nisha Huang,
Yuxin Zhang,
Weiming Dong
Abstract:
Large-scale text-to-video diffusion models have demonstrated an exceptional ability to synthesize diverse videos. However, due to the lack of extensive text-to-video datasets and the necessary computational resources for training, directly applying these models for video stylization remains difficult. Also, given that the noise addition process on the input content is random and destructive, fulfi…
▽ More
Large-scale text-to-video diffusion models have demonstrated an exceptional ability to synthesize diverse videos. However, due to the lack of extensive text-to-video datasets and the necessary computational resources for training, directly applying these models for video stylization remains difficult. Also, given that the noise addition process on the input content is random and destructive, fulfilling the style transfer task's content preservation criteria is challenging. This paper proposes a zero-shot video stylization method named Style-A-Video, which utilizes a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. We improve the guidance condition in the denoising process, establishing a balance between artistic expression and structure preservation. Furthermore, to decrease inter-frame flicker and avoid the formation of additional artifacts, we employ a sampling optimization and a temporal consistency module. Extensive experiments show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions. Code will be available at https://github.com/haha-lisa/Style-A-Video.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Region-Aware Diffusion for Zero-shot Text-driven Image Editing
Authors:
Nisha Huang,
Fan Tang,
Weiming Dong,
Tong-Yee Lee,
Changsheng Xu
Abstract:
Image manipulation under the guidance of textual descriptions has recently received a broad range of attention. In this study, we focus on the regional editing of images with the guidance of given text prompts. Different from current mask-based image editing methods, we propose a novel region-aware diffusion model (RDM) for entity-level image editing, which could automatically locate the region of…
▽ More
Image manipulation under the guidance of textual descriptions has recently received a broad range of attention. In this study, we focus on the regional editing of images with the guidance of given text prompts. Different from current mask-based image editing methods, we propose a novel region-aware diffusion model (RDM) for entity-level image editing, which could automatically locate the region of interest and replace it following given text prompts. To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline by combing latent space diffusion and enhanced directional guidance. In addition, to preserve image content in non-edited regions, we introduce regional-aware entity editing to modify the region of interest and preserve the out-of-interest region. We validate the proposed RDM beyond the baseline methods through extensive qualitative and quantitative experiments. The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency. The codes are available at https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Separating MAX 2-AND, MAX DI-CUT and MAX CUT
Authors:
Joshua Brakensiek,
Neng Huang,
Aaron Potechin,
Uri Zwick
Abstract:
Assuming the Unique Games Conjecture (UGC), the best approximation ratio that can be obtained in polynomial time for the MAX CUT problem is $α_{\text{CUT}}\simeq 0.87856$, obtained by the celebrated SDP-based approximation algorithm of Goemans and Williamson. The currently best approximation algorithm for MAX DI-CUT, i.e., the MAX CUT problem in directed graphs, achieves a ratio of about…
▽ More
Assuming the Unique Games Conjecture (UGC), the best approximation ratio that can be obtained in polynomial time for the MAX CUT problem is $α_{\text{CUT}}\simeq 0.87856$, obtained by the celebrated SDP-based approximation algorithm of Goemans and Williamson. The currently best approximation algorithm for MAX DI-CUT, i.e., the MAX CUT problem in directed graphs, achieves a ratio of about $0.87401$, leaving open the question whether MAX DI-CUT can be approximated as well as MAX CUT. We obtain a slightly improved algorithm for MAX DI-CUT and a new UGC-hardness result for it, showing that $0.87446\le α_{\text{DI-CUT}}\le 0.87461$, where $α_{\text{DI-CUT}}$ is the best approximation ratio that can be obtained in polynomial time for MAX DI-CUT under UGC. The new upper bound separates MAX DI-CUT from MAX CUT, resolving a question raised by Feige and Goemans.
A natural generalization of MAX DI-CUT is the MAX 2-AND problem in which each constraint is of the form $z_1\land z_2$, where $z_1$ and $z_2$ are literals, i.e., variables or their negations (In MAX DI-CUT each constraint is of the form $\bar{x}_1\land x_2$, where $x_1$ and $x_2$ are variables.) Austrin separated MAX 2-AND from MAX CUT by showing that $α_{\text{2AND}} < 0.87435$ and conjectured that MAX 2-AND and MAX DI-CUT have the same approximation ratio. Our new lower bound on MAX DI-CUT refutes this conjecture, completing the separation of the three problems MAX 2-AND, MAX DI-CUT and MAX CUT. We also obtain a new lower bound for MAX 2-AND, showing that $0.87414\le α_{\text{2AND}}\le 0.87435$.
Our upper bound on MAX DI-CUT is achieved via a simple, analytical proof. The lower bounds on MAX DI-CUT and MAX 2-AND (the new approximation algorithms) use experimentally-discovered distributions of rounding functions which are then verified via computer-assisted proofs.
△ Less
Submitted 12 April, 2023; v1 submitted 21 December, 2022;
originally announced December 2022.
-
Inversion-Based Style Transfer with Diffusion Models
Authors:
Yuxin Zhang,
Nisha Huang,
Fan Tang,
Haibin Huang,
Chongyang Ma,
Weiming Dong,
Changsheng Xu
Abstract:
The artistic style within a painting is the means of expression, which includes not only the painting material, colors, and brushstrokes, but also the high-level attributes including semantic elements, object shapes, etc. Previous arbitrary example-guided artistic image generation methods often fail to control shape changes or convey elements. The pre-trained text-to-image synthesis diffusion prob…
▽ More
The artistic style within a painting is the means of expression, which includes not only the painting material, colors, and brushstrokes, but also the high-level attributes including semantic elements, object shapes, etc. Previous arbitrary example-guided artistic image generation methods often fail to control shape changes or convey elements. The pre-trained text-to-image synthesis diffusion probabilistic models have achieved remarkable quality, but it often requires extensive textual descriptions to accurately portray attributes of a particular painting. We believe that the uniqueness of an artwork lies precisely in the fact that it cannot be adequately explained with normal language. Our key idea is to learn artistic style directly from a single painting and then guide the synthesis without providing complex textual descriptions. Specifically, we assume style as a learnable textual description of a painting. We propose an inversion-based style transfer method (InST), which can efficiently and accurately learn the key information of an image, thus capturing and transferring the artistic style of a painting. We demonstrate the quality and efficiency of our method on numerous paintings of various artists and styles. Code and models are available at https://github.com/zyxElsa/InST.
△ Less
Submitted 20 March, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
Authors:
Nisha Huang,
Yuxin Zhang,
Fan Tang,
Chongyang Ma,
Haibin Huang,
Yong Zhang,
Weiming Dong,
Changsheng Xu
Abstract:
Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into a stylized one according to textual descriptions of the target style provided by the user. Unlike the previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intui…
▽ More
Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into a stylized one according to textual descriptions of the target style provided by the user. Unlike the previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler, a dual diffusion processing architecture to control the balance between the content and style of the diffused results. The cross-modal style information can be easily integrated as guidance during the diffusion process step-by-step. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments. Code is available at \url{https://github.com/haha-lisa/Diffstyler}.
△ Less
Submitted 18 December, 2023; v1 submitted 19 November, 2022;
originally announced November 2022.
-
A Spectral Analysis of Graph Neural Networks on Dense and Sparse Graphs
Authors:
Luana Ruiz,
Ningyuan Huang,
Soledad Villar
Abstract:
In this work we propose a random graph model that can produce graphs at different levels of sparsity. We analyze how sparsity affects the graph spectra, and thus the performance of graph neural networks (GNNs) in node classification on dense and sparse graphs. We compare GNNs with spectral methods known to provide consistent estimators for community detection on dense graphs, a closely related tas…
▽ More
In this work we propose a random graph model that can produce graphs at different levels of sparsity. We analyze how sparsity affects the graph spectra, and thus the performance of graph neural networks (GNNs) in node classification on dense and sparse graphs. We compare GNNs with spectral methods known to provide consistent estimators for community detection on dense graphs, a closely related task. We show that GNNs can outperform spectral methods on sparse graphs, and illustrate these results with numerical examples on both synthetic and real graphs.
△ Less
Submitted 13 September, 2023; v1 submitted 6 November, 2022;
originally announced November 2022.
-
Deep Learning is Provably Robust to Symmetric Label Noise
Authors:
Carey E. Priebe,
Ningyuan Huang,
Soledad Villar,
Cong Mu,
Li Chen
Abstract:
Deep neural networks (DNNs) are capable of perfectly fitting the training data, including memorizing noisy data. It is commonly believed that memorization hurts generalization. Therefore, many recent works propose mitigation strategies to avoid noisy data or correct memorization. In this work, we step back and ask the question: Can deep learning be robust against massive label noise without any mi…
▽ More
Deep neural networks (DNNs) are capable of perfectly fitting the training data, including memorizing noisy data. It is commonly believed that memorization hurts generalization. Therefore, many recent works propose mitigation strategies to avoid noisy data or correct memorization. In this work, we step back and ask the question: Can deep learning be robust against massive label noise without any mitigation? We provide an affirmative answer for the case of symmetric label noise: We find that certain DNNs, including under-parameterized and over-parameterized models, can tolerate massive symmetric label noise up to the information-theoretic threshold. By appealing to classical statistical theory and universal consistency of DNNs, we prove that for multiclass classification, $L_1$-consistent DNN classifiers trained under symmetric label noise can achieve Bayes optimality asymptotically if the label noise probability is less than $\frac{K-1}{K}$, where $K \ge 2$ is the number of classes. Our results show that for symmetric label noise, no mitigation is necessary for $L_1$-consistent estimators. We conjecture that for general label noise, mitigation strategies that make use of the noisy data will outperform those that ignore the noisy data.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
Authors:
Nisha Huang,
Fan Tang,
Weiming Dong,
Changsheng Xu
Abstract:
Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model,…
▽ More
Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach that utilizes multimodal prompts as guidance to control the classifier-free diffusion model. Additionally, the contrastive language-image pretraining (CLIP) model is used to unify text and image modalities. Extensive experimental results on the quality and quantity of the generated digital art paintings confirm the effectiveness of the combination of the diffusion model and multimodal guidance. Code is available at https://github.com/haha-lisa/MGAD-multimodal-guided-artwork-diffusion.
△ Less
Submitted 28 September, 2022; v1 submitted 27 September, 2022;
originally announced September 2022.
-
From Local to Global: Spectral-Inspired Graph Neural Networks
Authors:
Ningyuan Huang,
Soledad Villar,
Carey E. Priebe,
Da Zheng,
Chengyue Huang,
Lin Yang,
Vladimir Braverman
Abstract:
Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate…
▽ More
Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate such issues, existing works typically borrow normalization techniques from training neural networks on Euclidean data or modify the graph structures. Yet these approaches are not well-understood theoretically and could increase the overall computational complexity. In this work, we draw inspirations from spectral graph embedding and propose $\texttt{PowerEmbed}$ -- a simple layer-wise normalization technique to boost MPNNs. We show $\texttt{PowerEmbed}$ can provably express the top-$k$ leading eigenvectors of the graph operator, which prevents over-smoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing. We apply $\texttt{PowerEmbed}$ in a wide range of simulated and real graphs and demonstrate its competitive performance, particularly for heterophilous graphs.
△ Less
Submitted 4 November, 2022; v1 submitted 24 September, 2022;
originally announced September 2022.
-
Endowing Language Models with Multimodal Knowledge Graph Representations
Authors:
Ningyuan Huang,
Yash R. Deshpande,
Yibo Liu,
Houda Alberts,
Kyunghyun Cho,
Clara Vania,
Iacer Calixto
Abstract:
We propose a method to make natural language understanding models more parameter efficient by storing knowledge in an external knowledge graph (KG) and retrieving from this KG using a dense index. Given (possibly multilingual) downstream task data, e.g., sentences in German, we retrieve entities from the KG and use their multimodal representations to improve downstream task performance. We use the…
▽ More
We propose a method to make natural language understanding models more parameter efficient by storing knowledge in an external knowledge graph (KG) and retrieving from this KG using a dense index. Given (possibly multilingual) downstream task data, e.g., sentences in German, we retrieve entities from the KG and use their multimodal representations to improve downstream task performance. We use the recently released VisualSem KG as our external knowledge repository, which covers a subset of Wikipedia and WordNet entities, and compare a mix of tuple-based and graph-based algorithms to learn entity and relation representations that are grounded on the KG multimodal information. We demonstrate the usefulness of the learned entity representations on two downstream tasks, and show improved performance on the multilingual named entity recognition task by $0.3\%$--$0.7\%$ F1, while we achieve up to $2.5\%$ improvement in accuracy on the visual sense disambiguation task. All our code and data are available in: \url{https://github.com/iacercalixto/visualsem-kg}.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Debiasing Learning for Membership Inference Attacks Against Recommender Systems
Authors:
Zihan Wang,
Na Huang,
Fei Sun,
Pengjie Ren,
Zhumin Chen,
Hengliang Luo,
Maarten de Rijke,
Zhaochun Ren
Abstract:
Learned recommender systems may inadvertently leak information about their training data, leading to privacy violations. We investigate privacy threats faced by recommender systems through the lens of membership inference. In such attacks, an adversary aims to infer whether a user's data is used to train the target recommender. To achieve this, previous work has used a shadow recommender to derive…
▽ More
Learned recommender systems may inadvertently leak information about their training data, leading to privacy violations. We investigate privacy threats faced by recommender systems through the lens of membership inference. In such attacks, an adversary aims to infer whether a user's data is used to train the target recommender. To achieve this, previous work has used a shadow recommender to derive training data for the attack model, and then predicts the membership by calculating difference vectors between users' historical interactions and recommended items. State-of-the-art methods face two challenging problems: (1) training data for the attack model is biased due to the gap between shadow and target recommenders, and (2) hidden states in recommenders are not observational, resulting in inaccurate estimations of difference vectors. To address the above limitations, we propose a Debiasing Learning for Membership Inference Attacks against recommender systems (DL-MIA) framework that has four main components: (1) a difference vector generator, (2) a disentangled encoder, (3) a weight estimator, and (4) an attack model. To mitigate the gap between recommenders, a variational auto-encoder (VAE) based disentangled encoder is devised to identify recommender invariant and specific features. To reduce the estimation bias, we design a weight estimator, assigning a truth-level score for each difference vector to indicate estimation accuracy. We evaluate DL-MIA against both general recommenders and sequential recommenders on three real-world datasets. Experimental results show that DL-MIA effectively alleviates training and estimation biases simultaneously, and achieves state-of-the-art attack performance.
△ Less
Submitted 28 June, 2022; v1 submitted 24 June, 2022;
originally announced June 2022.
-
A semi-conjugate gradient method for solving unsymmetric positive definite linear systems
Authors:
Na Huang,
Yu-Hong Dai,
Dominique Orban,
Michael A Saunders
Abstract:
The conjugate gradient (CG) method is a classic Krylov subspace method for solving symmetric positive definite linear systems. We introduce an analogous semi-conjugate gradient (SCG) method for unsymmetric positive definite linear systems. Unlike CG, SCG requires the solution of a lower triangular linear system to produce each semi-conjugate direction. We prove that SCG is theoretically equivalent…
▽ More
The conjugate gradient (CG) method is a classic Krylov subspace method for solving symmetric positive definite linear systems. We introduce an analogous semi-conjugate gradient (SCG) method for unsymmetric positive definite linear systems. Unlike CG, SCG requires the solution of a lower triangular linear system to produce each semi-conjugate direction. We prove that SCG is theoretically equivalent to the full orthogonalization method (FOM), which is based on the Arnoldi process and converges in a finite number of steps. Because SCG's triangular system increases in size each iteration, we study a sliding window implementation (SWI) to improve efficiency, and show that the directions produced are still locally semi-conjugate. A counterexample illustrates that SWI is different from the direct incomplete orthogonalization method (DIOM), which is FOM with a sliding window. Numerical experiments from the convection-diffusion equation and other applications show that SCG is robust and that the sliding window implementation SWI allows SCG to solve large systems efficiently.
△ Less
Submitted 8 June, 2022; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Deep Learning with Label Noise: A Hierarchical Approach
Authors:
Li Chen,
Ningyuan Huang,
Cong Mu,
Hayden S. Helm,
Kate Lytvynets,
Weiwei Yang,
Carey E. Priebe
Abstract:
Deep neural networks are susceptible to label noise. Existing methods to improve robustness, such as meta-learning and regularization, usually require significant change to the network architecture or careful tuning of the optimization procedure. In this work, we propose a simple hierarchical approach that incorporates a label hierarchy when training the deep learning models. Our approach requires…
▽ More
Deep neural networks are susceptible to label noise. Existing methods to improve robustness, such as meta-learning and regularization, usually require significant change to the network architecture or careful tuning of the optimization procedure. In this work, we propose a simple hierarchical approach that incorporates a label hierarchy when training the deep learning models. Our approach requires no change of the network architecture or the optimization procedure. We investigate our hierarchical network through a wide range of simulated and real datasets and various label noise types. Our hierarchical approach improves upon regular deep neural networks in learning with label noise. Combining our hierarchical approach with pre-trained models achieves state-of-the-art performance in real-world noisy datasets.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Time-Frequency Mask Aware Bi-directional LSTM: A Deep Learning Approach for Underwater Acoustic Signal Separation
Authors:
Jie Chen,
Chang Liu,
Jiawu Xie,
Jie An,
Nan Huang
Abstract:
The underwater acoustic signals separation is a key technique for the underwater communications. The existing methods are mostly model-based, and could not accurately characterise the practical underwater acoustic communication environment. They are only suitable for binary signal separation, but cannot handle multivariate signal separation. On the other hand, the recurrent neural network (RNN) sh…
▽ More
The underwater acoustic signals separation is a key technique for the underwater communications. The existing methods are mostly model-based, and could not accurately characterise the practical underwater acoustic communication environment. They are only suitable for binary signal separation, but cannot handle multivariate signal separation. On the other hand, the recurrent neural network (RNN) shows powerful capability in extracting the features of the temporal sequences. Inspired by this, in this paper, we present a data-driven approach for underwater acoustic signals separation using deep learning technology. We use the Bi-directional Long Short-Term Memory (Bi-LSTM) to explore the features of Time-Frequency (T-F) mask, and propose a T-F mask aware Bi-LSTM for signal separation. Taking advantage of the sparseness of the T-F image, the designed Bi-LSTM network is able to extract the discriminative features for separation, which further improves the separation performance. In particular, this method breaks through the limitations of the existing methods, not only achieves good results in multivariate separation, but also effectively separates signals when mixed with 40dB Gaussian noise signals. The experimental results show that this method can achieve a $97\%$ guarantee ratio (PSR), and the average similarity coefficient of the multivariate signal separation is stable above 0.8 under high noise conditions.
△ Less
Submitted 9 February, 2022;
originally announced February 2022.
-
A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants
Authors:
Ningyuan Huang,
Soledad Villar
Abstract:
Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to t…
▽ More
Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to the study of graph neural networks. The WL test can be generalized to a hierarchy of higher-order tests, known as $k$-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example.
△ Less
Submitted 1 November, 2022; v1 submitted 18 January, 2022;
originally announced January 2022.
-
Improved Receivers for Optical Wireless OFDM: An Information Theoretic Perspective
Authors:
Xiaozhen Liu,
Jing Zhou,
Nuo Huang,
Wenyi Zhang
Abstract:
We consider performance enhancement of asymmetrically-clipped optical orthogonal frequency division multiplexing (ACO-OFDM) and related optical OFDM schemes, which are variations of OFDM in intensity-modulated optical wireless communications. Unlike most existing studies on specific designs of improved receivers, this paper investigates information theoretic limits of all possible receivers. For i…
▽ More
We consider performance enhancement of asymmetrically-clipped optical orthogonal frequency division multiplexing (ACO-OFDM) and related optical OFDM schemes, which are variations of OFDM in intensity-modulated optical wireless communications. Unlike most existing studies on specific designs of improved receivers, this paper investigates information theoretic limits of all possible receivers. For independent and identically distributed complex Gaussian inputs, we obtain an exact characterization of information rate of ACO-OFDM with improved receivers for all SNRs. It is proved that the high-SNR gain of improved receivers asymptotically achieve 1/4 bits per channel use, which is equivalent to 3 dB in electrical SNR or 1.5 dB in optical SNR; as the SNR decreases, the maximum achievable SNR gain of improved receivers decreases monotonically to a non-zero low-SNR limit, corresponding to an information rate gain of 36.3%. For practically used constellations, we derive an upper bound on the gain of improved receivers. Numerical results demonstrate that the upper bound can be approached to within 1 dB in optical SNR by combining existing improved receivers and coded modulation. We also show that our information theoretic analyses can be extended to Flip-OFDM and PAM-DMT. Our results imply that, for the considered schemes, improved receivers may reduce the gap to channel capacity significantly at low-to-moderate SNR.
△ Less
Submitted 4 May, 2022; v1 submitted 18 January, 2022;
originally announced January 2022.
-
On Exploring Pose Estimation as an Auxiliary Learning Task for Visible-Infrared Person Re-identification
Authors:
Yunqi Miao,
Nianchang Huang,
Xiao Ma,
Qiang Zhang,
Jungong Han
Abstract:
Visible-infrared person re-identification (VI-ReID) has been challenging due to the existence of large discrepancies between visible and infrared modalities. Most pioneering approaches reduce intra-class variations and inter-modality discrepancies by learning modality-shared and ID-related features. However, an explicit modality-shared cue, i.e., body keypoints, has not been fully exploited in VI-…
▽ More
Visible-infrared person re-identification (VI-ReID) has been challenging due to the existence of large discrepancies between visible and infrared modalities. Most pioneering approaches reduce intra-class variations and inter-modality discrepancies by learning modality-shared and ID-related features. However, an explicit modality-shared cue, i.e., body keypoints, has not been fully exploited in VI-ReID. Additionally, existing feature learning paradigms imposed constraints on either global features or partitioned feature stripes, which neglect the prediction consistency of global and part features. To address the above problems, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework. By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features. On top of it, the learnings of global features and local features are seamlessly synchronized by Hierarchical Feature Constraint (HFC), where the former supervises the latter using the knowledge distillation strategy. Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins. Specifically, our method achieves nearly 20$\%$ mAP improvements against the state-of-the-art method on the RegDB dataset. Our intriguing findings highlight the usage of auxiliary task learning in VI-ReID.
△ Less
Submitted 23 February, 2022; v1 submitted 11 January, 2022;
originally announced January 2022.
-
Map-Assisted Constellation Design for mmWave WDM with OAM in Short-Range LOS Environment
Authors:
Yuan Wang,
Chen Gong,
Nuo Huang,
Zhengyuan Xu
Abstract:
We consider a system that integrates positioning and single-user millimeter wave (mmWave) communication, where the communication part adopts wavelength division multiplexing (WDM) and orbital angular momentum (OAM). This paper addresses the multi-dimensional constellation design in shortrange line-of-sight (LOS) environment, with stable communication links. We propose a map-assisted method to quan…
▽ More
We consider a system that integrates positioning and single-user millimeter wave (mmWave) communication, where the communication part adopts wavelength division multiplexing (WDM) and orbital angular momentum (OAM). This paper addresses the multi-dimensional constellation design in shortrange line-of-sight (LOS) environment, with stable communication links. We propose a map-assisted method to quantify the system parameters based on positions and reduce real-time computing overhead. We explore the possibility of using a few patterns in the maps, and investigate its performance loss. We first investigate the features of OAM beams, and find that the link gain ratio between any two sub-channels remains unchanged at some postions. Then, we prove that a fixed constellation can be adopted for the positions where the link gain matrices are sufficiently close to be proportional. Moreover, we prove that the system can adopt a fixed power vector to generate a multidimensional constellation if the difference between fixed power vector and optimal power vector is small. Finally, we figure out that the constellation design for all receiver locations can be represented by a few constellation sets.
△ Less
Submitted 11 October, 2022; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Neural PPO-Clip Attains Global Optimality: A Hinge Loss Perspective
Authors:
Nai-Chieh Huang,
Ping-Chun Hsieh,
Kuo-Hao Ho,
Hsuan-Yu Yao,
Kai-Chun Hu,
Liang-Chun Ouyang,
I-Chen Wu
Abstract:
Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical p…
▽ More
Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical proof up to date. In this paper, we establish the first global convergence rate of PPO-Clip under neural function approximation. We identify the fundamental challenges of analyzing PPO-Clip and address them with the two core ideas: (i) We reinterpret PPO-Clip from the perspective of hinge loss, which connects policy improvement with solving a large-margin classification problem with hinge loss and offers a generalized version of the PPO-Clip objective. (ii) Based on the above viewpoint, we propose a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme. Moreover, our theoretical results provide the first characterization of the effect of the clipping mechanism on the convergence of PPO-Clip. Through experiments, we empirically validate the reinterpretation of PPO-Clip and the generalized objective with various classifiers on various RL benchmark tasks.
△ Less
Submitted 31 August, 2022; v1 submitted 26 October, 2021;
originally announced October 2021.
-
Pulmonary Vessel Segmentation based on Orthogonal Fused U-Net++ of Chest CT Images
Authors:
Hejie Cui,
Xinglong Liu,
Ning Huang
Abstract:
Pulmonary vessel segmentation is important for clinical diagnosis of pulmonary diseases, while is also challenging due to the complicated structure. In this work, we present an effective framework and refinement process of pulmonary vessel segmentation from chest computed tomographic (CT) images. The key to our approach is a 2.5D segmentation network applied from three orthogonal axes, which prese…
▽ More
Pulmonary vessel segmentation is important for clinical diagnosis of pulmonary diseases, while is also challenging due to the complicated structure. In this work, we present an effective framework and refinement process of pulmonary vessel segmentation from chest computed tomographic (CT) images. The key to our approach is a 2.5D segmentation network applied from three orthogonal axes, which presents a robust and fully automated pulmonary vessel segmentation result with lower network complexity and memory usage compared to 3D networks. The slice radius is introduced to convolve the adjacent information of the center slice and the multi-planar fusion optimizes the presentation of intra- and inter- slice features. Besides, the tree-like structure of the pulmonary vessel is extracted in the post-processing process, which is used for segmentation refining and pruning. In the evaluation experiments, three fusion methods are tested and the most promising one is compared with the state-of-the-art 2D and 3D structures on 300 cases of lung images randomly selected from LIDC dataset. Our method outperforms other network structures by a large margin and achieves by far the highest average DICE score of 0.9272 and precision of 0.9310, as per our knowledge from the pulmonary vessel segmentation models available in the literature.
△ Less
Submitted 3 July, 2021;
originally announced July 2021.
-
Toward Drug-Target Interaction Prediction via Ensemble Modeling and Transfer Learning
Authors:
Po-Yu Kao,
Shu-Min Kao,
Nan-Lan Huang,
Yen-Chu Lin
Abstract:
Drug-target interaction (DTI) prediction plays a crucial role in drug discovery, and deep learning approaches have achieved state-of-the-art performance in this field. We introduce an ensemble of deep learning models (EnsembleDLM) for DTI prediction. EnsembleDLM only uses the sequence information of chemical compounds and proteins, and it aggregates the predictions from multiple deep neural networ…
▽ More
Drug-target interaction (DTI) prediction plays a crucial role in drug discovery, and deep learning approaches have achieved state-of-the-art performance in this field. We introduce an ensemble of deep learning models (EnsembleDLM) for DTI prediction. EnsembleDLM only uses the sequence information of chemical compounds and proteins, and it aggregates the predictions from multiple deep neural networks. This approach not only achieves state-of-the-art performance in Davis and KIBA datasets but also reaches cutting-edge performance in the cross-domain applications across different bio-activity types and different protein classes. We also demonstrate that EnsembleDLM achieves a good performance (Pearson correlation coefficient and concordance index > 0.8) in the new domain with approximately 50% transfer learning data, i.e., the training set has twice as much data as the test set.
△ Less
Submitted 18 November, 2021; v1 submitted 2 July, 2021;
originally announced July 2021.
-
Middle-level Fusion for Lightweight RGB-D Salient Object Detection
Authors:
Nianchang Huang,
Qiang Zhang,
Jungong Han
Abstract:
Most existing lightweight RGB-D salient object detection (SOD) models are based on two-stream structure or single-stream structure. The former one first uses two sub-networks to extract unimodal features from RGB and depth images, respectively, and then fuses them for SOD. While, the latter one directly extracts multi-modal features from the input RGB-D images and then focuses on exploiting cross-…
▽ More
Most existing lightweight RGB-D salient object detection (SOD) models are based on two-stream structure or single-stream structure. The former one first uses two sub-networks to extract unimodal features from RGB and depth images, respectively, and then fuses them for SOD. While, the latter one directly extracts multi-modal features from the input RGB-D images and then focuses on exploiting cross-level complementary information. However, two-stream structure based models inevitably require more parameters and single-stream structure based ones cannot well exploit the cross-modal complementary information since they ignore the modality difference. To address these issues, we propose to employ the middle-level fusion structure for designing lightweight RGB-D SOD model in this paper, which first employs two sub-networks to extract low- and middle-level unimodal features, respectively, and then fuses those extracted middle-level unimodal features for extracting corresponding high-level multi-modal features in the subsequent sub-network. Different from existing models, this structure can effectively exploit the cross-modal complementary information and significantly reduce the network's parameters, simultaneously. Therefore, a novel lightweight SOD model is designed, which contains a information-aware multi-modal feature fusion (IMFF) module for effectively capturing the cross-modal complementary information and a lightweight feature-level and decision-level feature fusion (LFDF) module for aggregating the feature-level and the decision-level saliency information in different stages with less parameters. Our proposed model has only 3.9M parameters and runs at 33 FPS. The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.
△ Less
Submitted 5 June, 2021; v1 submitted 23 April, 2021;
originally announced April 2021.
-
Exploring Modality-shared Appearance Features and Modality-invariant Relation Features for Cross-modality Person Re-Identification
Authors:
Nianchang Huang,
Jianan Liu,
Qiang Zhang,
Jungong Han
Abstract:
Most existing cross-modality person re-identification works rely on discriminative modality-shared features for reducing cross-modality variations and intra-modality variations. Despite some initial success, such modality-shared appearance features cannot capture enough modality-invariant discriminative information due to a massive discrepancy between RGB and infrared images. To address this issue…
▽ More
Most existing cross-modality person re-identification works rely on discriminative modality-shared features for reducing cross-modality variations and intra-modality variations. Despite some initial success, such modality-shared appearance features cannot capture enough modality-invariant discriminative information due to a massive discrepancy between RGB and infrared images. To address this issue, on the top of appearance features, we further capture the modality-invariant relations among different person parts (referred to as modality-invariant relation features), which are the complement to those modality-shared appearance features and help to identify persons with similar appearances but different body shapes. To this end, a Multi-level Two-streamed Modality-shared Feature Extraction (MTMFE) sub-network is designed, where the modality-shared appearance features and modality-invariant relation features are first extracted in a shared 2D feature space and a shared 3D feature space, respectively. The two features are then fused into the final modality-shared features such that both cross-modality variations and intra-modality variations can be reduced. Besides, a novel cross-modality quadruplet loss is proposed to further reduce the cross-modality variations. Experimental results on several benchmark datasets demonstrate that our proposed method exceeds state-of-the-art algorithms by a noticeable margin.
△ Less
Submitted 23 April, 2021;
originally announced April 2021.
-
DeepSZ: Identification of Sunyaev-Zel'dovich Galaxy Clusters using Deep Learning
Authors:
Zhen Lin,
Nicholas Huang,
Camille Avestruz,
W. L. Kimmy Wu,
Shubhendu Trivedi,
João Caldeira,
Brian Nord
Abstract:
Galaxy clusters identified from the Sunyaev Zel'dovich (SZ) effect are a key ingredient in multi-wavelength cluster-based cosmology. We present a comparison between two methods of cluster identification: the standard Matched Filter (MF) method in SZ cluster finding and a method using Convolutional Neural Networks (CNN). We further implement and show results for a `combined' identifier. We apply th…
▽ More
Galaxy clusters identified from the Sunyaev Zel'dovich (SZ) effect are a key ingredient in multi-wavelength cluster-based cosmology. We present a comparison between two methods of cluster identification: the standard Matched Filter (MF) method in SZ cluster finding and a method using Convolutional Neural Networks (CNN). We further implement and show results for a `combined' identifier. We apply the methods to simulated millimeter maps for several observing frequencies for an SPT-3G-like survey. There are some key differences between the methods. The MF method requires image pre-processing to remove point sources and a model for the noise, while the CNN method requires very little pre-processing of images. Additionally, the CNN requires tuning of hyperparameters in the model and takes as input, cutout images of the sky. Specifically, we use the CNN to classify whether or not an 8 arcmin $\times$ 8 arcmin cutout of the sky contains a cluster. We compare differences in purity and completeness. The MF signal-to-noise ratio depends on both mass and redshift. Our CNN, trained for a given mass threshold, captures a different set of clusters than the MF, some of which have SNR below the MF detection threshold. However, the CNN tends to mis-classify cutouts whose clusters are located near the edge of the cutout, which can be mitigated with staggered cutouts. We leverage the complementarity of the two methods, combining the scores from each method for identification. The purity and completeness of the MF alone are both 0.61, assuming a standard detection threshold. The purity and completeness of the CNN alone are 0.59 and 0.61. The combined classification method yields 0.60 and 0.77, a significant increase for completeness with a modest decrease in purity. We advocate for combined methods that increase the confidence of many lower signal-to-noise clusters.
△ Less
Submitted 8 March, 2021; v1 submitted 25 February, 2021;
originally announced February 2021.
-
Dimensionality reduction, regularization, and generalization in overparameterized regressions
Authors:
Ningyuan Huang,
David W. Hogg,
Soledad Villar
Abstract:
Overparameterization in deep learning is powerful: Very large models fit the training data perfectly and yet often generalize well. This realization brought back the study of linear models for regression, including ordinary least squares (OLS), which, like deep learning, shows a "double-descent" behavior: (1) The risk (expected out-of-sample prediction error) can grow arbitrarily when the number o…
▽ More
Overparameterization in deep learning is powerful: Very large models fit the training data perfectly and yet often generalize well. This realization brought back the study of linear models for regression, including ordinary least squares (OLS), which, like deep learning, shows a "double-descent" behavior: (1) The risk (expected out-of-sample prediction error) can grow arbitrarily when the number of parameters $p$ approaches the number of samples $n$, and (2) the risk decreases with $p$ for $p>n$, sometimes achieving a lower value than the lowest risk for $p<n$. The divergence of the risk for OLS can be avoided with regularization. In this work, we show that for some data models it can also be avoided with a PCA-based dimensionality reduction (PCA-OLS, also known as principal component regression). We provide non-asymptotic bounds for the risk of PCA-OLS by considering the alignments of the population and empirical principal components. We show that dimensionality reduction improves robustness while OLS is arbitrarily susceptible to adversarial attacks, particularly in the overparameterized regime. We compare PCA-OLS theoretically and empirically with a wide range of projection-based methods, including random projections, partial least squares (PLS), and certain classes of linear two-layer neural networks. These comparisons are made for different data generation models to assess the sensitivity to signal-to-noise and the alignment of regression coefficients with the features. We find that methods in which the projection depends on the training data can outperform methods where the projections are chosen independently of the training data, even those with oracle knowledge of population quantities, another seemingly paradoxical phenomenon that has been identified previously. This suggests that overparameterization may not be necessary for good generalization.
△ Less
Submitted 19 October, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
A Simple Spectral Failure Mode for Graph Convolutional Networks
Authors:
Carey E. Priebe,
Cencheng Shen,
Ningyuan Huang,
Tianyi Chen
Abstract:
Neural networks have achieved remarkable successes in machine learning tasks. This has recently been extended to graph learning using neural networks. However, there is limited theoretical work in understanding how and when they perform well, especially relative to established statistical learning techniques such as spectral embedding. In this short paper, we present a simple generative model wher…
▽ More
Neural networks have achieved remarkable successes in machine learning tasks. This has recently been extended to graph learning using neural networks. However, there is limited theoretical work in understanding how and when they perform well, especially relative to established statistical learning techniques such as spectral embedding. In this short paper, we present a simple generative model where unsupervised graph convolutional network fails, while the adjacency spectral embedding succeeds. Specifically, unsupervised graph convolutional network is unable to look beyond the first eigenvector in certain approximately regular graphs, thus missing inference signals in non-leading eigenvectors. The phenomenon is demonstrated by visual illustrations and comprehensive simulations.
△ Less
Submitted 11 August, 2021; v1 submitted 25 October, 2020;
originally announced October 2020.