Search | arXiv e-print repository

arXiv:2405.07551 [pdf, other]

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

Authors: Shuo Yin, Weihao You, Zhilong Ji, Guoqiang Zhong, Jinfeng Bai

Abstract: The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly in… ▽ More The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly include new math questions via multi-perspective data augmenting methods and then synthesize code-nested solutions to them. The open LLMs (i.e., Llama-2) are finetuned on the augmented dataset to get the resulting models, MuMath-Code ($μ$-Math-Code). During the inference phase, our MuMath-Code generates code and interacts with the external python interpreter to get the execution results. Therefore, MuMath-Code leverages the advantages of both the external tool and data augmentation. To fully leverage the advantages of our augmented data, we propose a two-stage training strategy: In Stage-1, we finetune Llama-2 on pure CoT data to get an intermediate model, which then is trained on the code-nested data in Stage-2 to get the resulting MuMath-Code. Our MuMath-Code-7B achieves 83.8 on GSM8K and 52.4 on MATH, while MuMath-Code-70B model achieves new state-of-the-art performance among open methods -- achieving 90.7% on GSM8K and 55.1% on MATH. Extensive experiments validate the combination of tool use and data augmentation, as well as our two-stage training strategy. We release the proposed dataset along with the associated code for public use. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: The state-of-the-art open-source tool-use LLMs for mathematical reasoning

arXiv:2404.01194 [pdf, other]

Adaptive Query Prompting for Multi-Domain Landmark Detection

Authors: Qiusen Wei, Guoheng Huang, Xiaochen Yuan, Xuhang Chen, Guo Zhong, Jianwen Huang, Jiajie Huang

Abstract: Medical landmark detection is crucial in various medical imaging modalities and procedures. Although deep learning-based methods have achieve promising performance, they are mostly designed for specific anatomical regions or tasks. In this work, we propose a universal model for multi-domain landmark detection by leveraging transformer architecture and developing a prompting component, named as Ada… ▽ More Medical landmark detection is crucial in various medical imaging modalities and procedures. Although deep learning-based methods have achieve promising performance, they are mostly designed for specific anatomical regions or tasks. In this work, we propose a universal model for multi-domain landmark detection by leveraging transformer architecture and developing a prompting component, named as Adaptive Query Prompting (AQP). Instead of embedding additional modules in the backbone network, we design a separate module to generate prompts that can be effectively extended to any other transformer network. In our proposed AQP, prompts are learnable parameters maintained in a memory space called prompt pool. The central idea is to keep the backbone frozen and then optimize prompts to instruct the model inference process. Furthermore, we employ a lightweight decoder to decode landmarks from the extracted features, namely Light-MLD. Thanks to the lightweight nature of the decoder and AQP, we can handle multiple datasets by sharing the backbone encoder and then only perform partial parameter tuning without incurring much additional cost. It has the potential to be extended to more landmark detection tasks. We conduct experiments on three widely used X-ray datasets for different medical landmark detection tasks. Our proposed Light-MLD coupled with AQP achieves SOTA performance on many metrics even without the use of elaborate structural designs or complex frameworks. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.01127 [pdf, other]

Medical Visual Prompting (MVP): A Unified Framework for Versatile and High-Quality Medical Image Segmentation

Authors: Yulin Chen, Guoheng Huang, Kai Huang, Zijin Lin, Guo Zhong, Shenghong Luo, Jie Deng, Jian Zhou

Abstract: Accurate segmentation of lesion regions is crucial for clinical diagnosis and treatment across various diseases. While deep convolutional networks have achieved satisfactory results in medical image segmentation, they face challenges such as loss of lesion shape information due to continuous convolution and downsampling, as well as the high cost of manually labeling lesions with varying shapes and… ▽ More Accurate segmentation of lesion regions is crucial for clinical diagnosis and treatment across various diseases. While deep convolutional networks have achieved satisfactory results in medical image segmentation, they face challenges such as loss of lesion shape information due to continuous convolution and downsampling, as well as the high cost of manually labeling lesions with varying shapes and sizes. To address these issues, we propose a novel medical visual prompting (MVP) framework that leverages pre-training and prompting concepts from natural language processing (NLP). The framework utilizes three key components: Super-Pixel Guided Prompting (SPGP) for superpixelating the input image, Image Embedding Guided Prompting (IEGP) for freezing patch embedding and merging with superpixels to provide visual prompts, and Adaptive Attention Mechanism Guided Prompting (AAGP) for pinpointing prompt content and efficiently adapting all layers. By integrating SPGP, IEGP, and AAGP, the MVP enables the segmentation network to better learn shape prompting information and facilitates mutual learning across different tasks. Extensive experiments conducted on five datasets demonstrate superior performance of this method in various challenging medical image tasks, while simplifying single-task medical segmentation models. This novel framework offers improved performance with fewer parameters and holds significant potential for accurate segmentation of lesion regions in various medical tasks, making it clinically valuable. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2312.06207 [pdf, other]

A Primer on RecoNIC: RDMA-enabled Compute Offloading on SmartNIC

Authors: Guanwen Zhong, Aditya Kolekar, Burin Amornpaisannon, Inho Choi, Haris Javaid, Mario Baldi

Abstract: Today's data centers consist of thousands of network-connected hosts, each with CPUs and accelerators such as GPUs and FPGAs. These hosts also contain network interface cards (NICs), operating at speeds of 100Gb/s or higher, that are used to communicate with each other. We propose RecoNIC, an FPGA-based RDMA-enabled SmartNIC platform that is designed for compute acceleration while minimizing the o… ▽ More Today's data centers consist of thousands of network-connected hosts, each with CPUs and accelerators such as GPUs and FPGAs. These hosts also contain network interface cards (NICs), operating at speeds of 100Gb/s or higher, that are used to communicate with each other. We propose RecoNIC, an FPGA-based RDMA-enabled SmartNIC platform that is designed for compute acceleration while minimizing the overhead associated with data copies (in CPU-centric accelerator systems) by bringing network data as close to computation as possible. Since RDMA is the defacto transport-layer protocol for improved communication in data center workloads, RecoNIC includes an RDMA offload engine for high throughput and low latency data transfers. Developers have the flexibility to design their accelerators using RTL, HLS or Vitis Networking P4 within the RecoNIC's programmable compute blocks. These compute blocks can access host memory as well as memory in remote peers through the RDMA offload engine. Furthermore, the RDMA offload engine is shared by both the host and compute blocks, which makes RecoNIC a very flexible platform. Lastly, we have open-sourced RecoNIC for the research community to enable experimentation with RDMA-based applications and use-cases. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: RecoNIC is available at https://github.com/Xilinx/RecoNIC

arXiv:2308.11029 [pdf, other]

doi 10.1109/TASLP.2023.3284509

RBA-GCN: Relational Bilevel Aggregation Graph Convolutional Network for Emotion Recognition

Authors: Lin Yuan, Guoheng Huang, Fenghuan Li, Xiaochen Yuan, Chi-Man Pun, Guo Zhong

Abstract: Emotion recognition in conversation (ERC) has received increasing attention from researchers due to its wide range of applications.As conversation has a natural graph structure,numerous approaches used to model ERC based on graph convolutional networks (GCNs) have yielded significant results.However,the aggregation approach of traditional GCNs suffers from the node information redundancy problem,l… ▽ More Emotion recognition in conversation (ERC) has received increasing attention from researchers due to its wide range of applications.As conversation has a natural graph structure,numerous approaches used to model ERC based on graph convolutional networks (GCNs) have yielded significant results.However,the aggregation approach of traditional GCNs suffers from the node information redundancy problem,leading to node discriminant information loss.Additionally,single-layer GCNs lack the capacity to capture long-range contextual information from the graph. Furthermore,the majority of approaches are based on textual modality or stitching together different modalities, resulting in a weak ability to capture interactions between modalities. To address these problems, we present the relational bilevel aggregation graph convolutional network (RBA-GCN), which consists of three modules: the graph generation module (GGM), similarity-based cluster building module (SCBM) and bilevel aggregation module (BiAM). First, GGM constructs a novel graph to reduce the redundancy of target node information.Then,SCBM calculates the node similarity in the target node and its structural neighborhood, where noisy information with low similarity is filtered out to preserve the discriminant information of the node. Meanwhile, BiAM is a novel aggregation method that can preserve the information of nodes during the aggregation process. This module can construct the interaction between different modalities and capture long-range contextual information based on similarity clusters. On both the IEMOCAP and MELD datasets, the weighted average F1 score of RBA-GCN has a 2.17$\sim$5.21\% improvement over that of the most advanced method.Our code is available at https://github.com/luftmenscher/RBA-GCN and our article with the same name has been published in IEEE/ACM Transactions on Audio,Speech,and Language Processing,vol.31,2023 △ Less

Submitted 31 August, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2325-2337,2023

arXiv:2308.01147 [pdf, other]

Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation

Authors: Guojin Zhong, Jin Yuan, Pan Wang, Kailun Yang, Weili Guan, Zhiyong Li

Abstract: The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which introduces contrastive posit… ▽ More The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve the generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2%-12% DTW improvements. The code will be released at https://github.com/zgj77/FSACDM. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023. The code will be released at https://github.com/zgj77/FSACDM

arXiv:2307.14132 [pdf, other]

CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

Authors: Tian-Hao Zhang, Dinghao Zhou, Guiping Zhong, Jiaming Zhou, Baoxiang Li

Abstract: RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which inco… ▽ More RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models. △ Less

Submitted 14 December, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: Accepted by ICASSP 2024

arXiv:2304.14131 [pdf, other]

doi 10.1109/TGRS.2023.3311510

TempEE: Temporal-Spatial Parallel Transformer for Radar Echo Extrapolation Beyond Auto-Regression

Authors: Shengchao Chen, Ting Shu, Huan Zhao, Guo Zhong, Xunlai Chen

Abstract: Meteorological radar reflectivity data (i.e. radar echo) significantly influences precipitation prediction. It can facilitate accurate and expeditious forecasting of short-term heavy rainfall bypassing the need for complex Numerical Weather Prediction (NWP) models. In comparison to conventional models, Deep Learning (DL)-based radar echo extrapolation algorithms exhibit higher effectiveness and ef… ▽ More Meteorological radar reflectivity data (i.e. radar echo) significantly influences precipitation prediction. It can facilitate accurate and expeditious forecasting of short-term heavy rainfall bypassing the need for complex Numerical Weather Prediction (NWP) models. In comparison to conventional models, Deep Learning (DL)-based radar echo extrapolation algorithms exhibit higher effectiveness and efficiency. Nevertheless, the development of reliable and generalized echo extrapolation algorithm is impeded by three primary challenges: cumulative error spreading, imprecise representation of sparsely distributed echoes, and inaccurate description of non-stationary motion processes. To tackle these challenges, this paper proposes a novel radar echo extrapolation algorithm called Temporal-Spatial Parallel Transformer, referred to as TempEE. TempEE avoids using auto-regression and instead employs a one-step forward strategy to prevent cumulative error spreading during the extrapolation process. Additionally, we propose the incorporation of a Multi-level Temporal-Spatial Attention mechanism to improve the algorithm's capability of capturing both global and local information while emphasizing task-related regions, including sparse echo representations, in an efficient manner. Furthermore, the algorithm extracts spatio-temporal representations from continuous echo images using a parallel encoder to model the non-stationary motion process for echo extrapolation. The superiority of our TempEE has been demonstrated in the context of the classic radar echo extrapolation task, utilizing a real-world dataset. Extensive experiments have further validated the efficacy and indispensability of various components within TempEE. △ Less

Submitted 14 September, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: Have been accepted by IEEE Transactions on Geoscience and Remote Sensing, see https://ieeexplore.ieee.org/document/10238744

Journal ref: IEEE Transactions on Geoscience and Remote Sensing 61, 5108914 (2023)

arXiv:2304.04368 [pdf, other]

Locality Preserving Multiview Graph Hashing for Large Scale Remote Sensing Image Search

Authors: Wenyun Li, Guo Zhong, Xingyu Lu, Chi-Man Pun

Abstract: Hashing is very popular for remote sensing image search. This article proposes a multiview hashing with learnable parameters to retrieve the queried images for a large-scale remote sensing dataset. Existing methods always neglect that real-world remote sensing data lies on a low-dimensional manifold embedded in high-dimensional ambient space. Unlike previous methods, this article proposes to learn… ▽ More Hashing is very popular for remote sensing image search. This article proposes a multiview hashing with learnable parameters to retrieve the queried images for a large-scale remote sensing dataset. Existing methods always neglect that real-world remote sensing data lies on a low-dimensional manifold embedded in high-dimensional ambient space. Unlike previous methods, this article proposes to learn the consensus compact codes in a view-specific low-dimensional subspace. Furthermore, we have added a hyperparameter learnable module to avoid complex parameter tuning. In order to prove the effectiveness of our method, we carried out experiments on three widely used remote sensing data sets and compared them with seven state-of-the-art methods. Extensive experiments show that the proposed method can achieve competitive results compared to the other method. △ Less

Submitted 9 April, 2023; originally announced April 2023.

Comments: 5 pages,icassp accepted

arXiv:2303.13769 [pdf, other]

Unknown Sniffer for Object Detection: Don't Turn a Blind Eye to Unknown Objects

Authors: Wenteng Liang, Feng Xue, Yihao Liu, Guofeng Zhong, Anlong Ming

Abstract: The recently proposed open-world object and open-set detection have achieved a breakthrough in finding never-seen-before objects and distinguishing them from known ones. However, their studies on knowledge transfer from known classes to unknown ones are not deep enough, resulting in the scanty capability for detecting unknowns hidden in the background. In this paper, we propose the unknown sniffer… ▽ More The recently proposed open-world object and open-set detection have achieved a breakthrough in finding never-seen-before objects and distinguishing them from known ones. However, their studies on knowledge transfer from known classes to unknown ones are not deep enough, resulting in the scanty capability for detecting unknowns hidden in the background. In this paper, we propose the unknown sniffer (UnSniffer) to find both unknown and known objects. Firstly, the generalized object confidence (GOC) score is introduced, which only uses known samples for supervision and avoids improper suppression of unknowns in the background. Significantly, such confidence score learned from known objects can be generalized to unknown ones. Additionally, we propose a negative energy suppression loss to further suppress the non-object samples in the background. Next, the best box of each unknown is hard to obtain during inference due to lacking their semantic information in training. To solve this issue, we introduce a graph-based determination scheme to replace hand-designed non-maximum suppression (NMS) post-processing. Finally, we present the Unknown Object Detection Benchmark, the first publicly benchmark that encompasses precision evaluation for unknown detection to our knowledge. Experiments show that our method is far better than the existing state-of-the-art methods. △ Less

Submitted 19 April, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

Comments: CVPR 2023 camera-ready; Code: https://github.com/Went-Liang/UnSniffer Project: https://xuefengbupt.github.io/project_page/unsniffer_cvpr23.html Demo: https://www.bilibili.com/video/BV1xM4y1z7Hv Supplymentary: https://xuefengbupt.github.io/project_page/pdf/supplementary_cvpr2023.pdf

arXiv:2303.03707 [pdf]

Hybrid quantum-classical convolutional neural network for phytoplankton classification

Authors: Shangshang Shi, Zhimin Wang, Ruimin Shang, Yanan Li, Jiaxin Li, Guoqiang Zhong, Yongjian Gu

Abstract: The taxonomic composition and abundance of phytoplankton, having direct impact on marine ecosystem dynamic and global environment change, are listed as essential ocean variables. Phytoplankton classification is very crucial for Phytoplankton analysis, but it is very difficult because of the huge amount and tiny volume of Phytoplankton. Machine learning is the principle way of performing phytoplank… ▽ More The taxonomic composition and abundance of phytoplankton, having direct impact on marine ecosystem dynamic and global environment change, are listed as essential ocean variables. Phytoplankton classification is very crucial for Phytoplankton analysis, but it is very difficult because of the huge amount and tiny volume of Phytoplankton. Machine learning is the principle way of performing phytoplankton image classification automatically. When carrying out large-scale research on the marine phytoplankton, the volume of data increases overwhelmingly and more powerful computational resources are required for the success of machine learning algorithms. Recently, quantum machine learning has emerged as the potential solution for large-scale data processing by harnessing the exponentially computational power of quantum computer. Here, for the first time, we demonstrate the feasibility of quantum deep neural networks for phytoplankton classification. Hybrid quantum-classical convolutional and residual neural networks are developed based on the classical architectures. These models make a proper balance between the limited function of the current quantum devices and the large size of phytoplankton images, which make it possible to perform phytoplankton classification on the near-term quantum computers. Better performance is obtained by the quantum-enhanced models against the classical counterparts. In particular, quantum models converge much faster than classical ones. The present quantum models are versatile, and can be applied for various tasks of image classification in the field of marine science. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: 20 pages, 13 figures

arXiv:2302.03244 [pdf, other]

Quantum Recurrent Neural Networks for Sequential Learning

Authors: Yanan Li, Zhimin Wang, Rongbing Han, Shangshang Shi, Jiaxin Li, Ruimin Shang, Haiyong Zheng, Guoqiang Zhong, Yongjian Gu

Abstract: Quantum neural network (QNN) is one of the promising directions where the near-term noisy intermediate-scale quantum (NISQ) devices could find advantageous applications against classical resources. Recurrent neural networks are the most fundamental networks for sequential learning, but up to now there is still a lack of canonical model of quantum recurrent neural network (QRNN), which certainly re… ▽ More Quantum neural network (QNN) is one of the promising directions where the near-term noisy intermediate-scale quantum (NISQ) devices could find advantageous applications against classical resources. Recurrent neural networks are the most fundamental networks for sequential learning, but up to now there is still a lack of canonical model of quantum recurrent neural network (QRNN), which certainly restricts the research in the field of quantum deep learning. In the present work, we propose a new kind of QRNN which would be a good candidate as the canonical QRNN model, where, the quantum recurrent blocks (QRBs) are constructed in the hardware-efficient way, and the QRNN is built by stacking the QRBs in a staggered way that can greatly reduce the algorithm's requirement with regard to the coherent time of quantum devices. That is, our QRNN is much more accessible on NISQ devices. Furthermore, the performance of the present QRNN model is verified concretely using three different kinds of classical sequential data, i.e., meteorological indicators, stock price, and text categorization. The numerical experiments show that our QRNN achieves much better performance in prediction (classification) accuracy against the classical RNN and state-of-the-art QNN models for sequential learning, and can predict the changing details of temporal sequence data. The practical circuit structure and superior performance indicate that the present QRNN is a promising learning model to find quantum advantageous applications in the near term. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2203.15613 [pdf, other]

Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer

Authors: Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li

Abstract: An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automati… ▽ More An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an average latency of 640ms, our model achieves a relative WER reduction of 6.0% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, submitted to interspeech 2022

arXiv:2203.15609 [pdf, other]

Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Authors: Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li, Yiran Zhong

Abstract: Conformer has shown a great success in automatic speech recognition (ASR) on many public benchmarks. One of its crucial drawbacks is the quadratic time-space complexity with respect to the input sequence length, which prohibits the model to scale-up as well as process longer input audio sequences. To solve this issue, numerous linear attention methods have been proposed. However, these methods oft… ▽ More Conformer has shown a great success in automatic speech recognition (ASR) on many public benchmarks. One of its crucial drawbacks is the quadratic time-space complexity with respect to the input sequence length, which prohibits the model to scale-up as well as process longer input audio sequences. To solve this issue, numerous linear attention methods have been proposed. However, these methods often have limited performance on ASR as they treat tokens equally in modeling, neglecting the fact that the neighbouring tokens are often more connected than the distanced tokens. In this paper, we take this fact into account and propose a new locality-biased linear attention for Conformer. It not only achieves higher accuracy than the vanilla Conformer, but also enjoys linear space-time computational complexity. To be specific, we replace the softmax attention with a locality-biased linear attention (LBLA) mechanism in Conformer blocks. The LBLA contains a kernel function to ensure the linear complexities and a cosine reweighing matrix to impose more weights on neighbouring tokens. Extensive experiments on the LibriSpeech corpus show that by introducing this locality bias to the Conformer, our method achieves a lower word error rate with more than 22% inference speed. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, submitted to interspeech 2022

arXiv:2006.10250 [pdf, other]

Progressively Unfreezing Perceptual GAN

Authors: Jinxuan Sun, Yang Chen, Junyu Dong, Guoqiang Zhong

Abstract: Generative adversarial networks (GANs) are widely used in image generation tasks, yet the generated images are usually lack of texture details. In this paper, we propose a general framework, called Progressively Unfreezing Perceptual GAN (PUPGAN), which can generate images with fine texture details. Particularly, we propose an adaptive perceptual discriminator with a pre-trained perceptual feature… ▽ More Generative adversarial networks (GANs) are widely used in image generation tasks, yet the generated images are usually lack of texture details. In this paper, we propose a general framework, called Progressively Unfreezing Perceptual GAN (PUPGAN), which can generate images with fine texture details. Particularly, we propose an adaptive perceptual discriminator with a pre-trained perceptual feature extractor, which can efficiently measure the discrepancy between multi-level features of the generated and real images. In addition, we propose a progressively unfreezing scheme for the adaptive perceptual discriminator, which ensures a smooth transfer process from a large scale classification task to a specified image generation task. The qualitative and quantitative experiments with comparison to the classical baselines on three image generation tasks, i.e. single image super-resolution, paired image-to-image translation and unpaired image-to-image translation demonstrate the superiority of PUPGAN over the compared approaches. △ Less

Submitted 17 June, 2020; originally announced June 2020.

arXiv:1909.13411 [pdf, other]

SymmetricNet: A mesoscale eddy detection method based on multivariate fusion data

Authors: Zhenlin Fan, Guoqiang Zhong

Abstract: Mesoscale eddies play a significant role in marine energy transport, marine biological environment and marine climate. Due to their huge impact on the ocean, mesoscale eddy detection has become a hot research area in recent years. Therefore, more and more people are entering the field of mesoscale eddy detection. However, the existing detection methods mainly based on traditional detection methods… ▽ More Mesoscale eddies play a significant role in marine energy transport, marine biological environment and marine climate. Due to their huge impact on the ocean, mesoscale eddy detection has become a hot research area in recent years. Therefore, more and more people are entering the field of mesoscale eddy detection. However, the existing detection methods mainly based on traditional detection methods typically only use Sea Surface Height (SSH) as a variable to detect, resulting in inaccurate performance. In this paper, we propose a mesoscale eddy detection method based on multivariate fusion data to solve this problem. We not only use the SSH variable, but also add the two variables: Sea Surface Temperature (SST) and velocity of flow, achieving a multivariate information fusion input. We design a novel symmetric network, which merges low-level feature maps from the downsampling pathway and high-level feature maps from the upsampling pathway by lateral connection. In addition, we apply dilated convolutions to network structure to increase the receptive field and obtain more contextual information in the case of constant parameter. In the end, we demonstrate the effectiveness of our method on dataset provided by us, achieving the test set performance of 97.06% , greatly improved the performance of previous methods of mesoscale eddy detection. △ Less

Submitted 29 September, 2019; originally announced September 2019.

arXiv:1909.07827 [pdf, other]

Weak Edge Identification Nets for Ocean Front Detection

Authors: Qingyang Li, Guoqiang Zhong, Cui Xie

Abstract: The ocean front has an important impact in many areas, it is meaningful to obtain accurate ocean front positioning, therefore, ocean front detection is a very important task. However, the traditional edge detection algorithm does not detect the weak edge information of the ocean front very well. In response to this problem, we collected relevant ocean front gradient images and found relevant exper… ▽ More The ocean front has an important impact in many areas, it is meaningful to obtain accurate ocean front positioning, therefore, ocean front detection is a very important task. However, the traditional edge detection algorithm does not detect the weak edge information of the ocean front very well. In response to this problem, we collected relevant ocean front gradient images and found relevant experts to calibrate the ocean front data to obtain groundtruth, and proposed a weak edge identification nets(WEIN) for ocean front detection. Whether it is qualitative or quantitative, our methods perform best. The method uses a welltrained deep learning model to accurately extract the ocean front from the ocean front gradient image. The detection network is divided into multiple stages, and the final output is a multi-stage output image fusion. The method uses the stochastic gradient descent and the correlation loss function to obtain a good ocean front image output. △ Less

Submitted 17 September, 2019; originally announced September 2019.

arXiv:1812.01353 [pdf, other]

Structured Semantic Model supported Deep Neural Network for Click-Through Rate Prediction

Authors: Chenglei Niu, Guojing Zhong, Ying Liu, Yandong Zhang, Yongsheng Sun, Ailong He, Zhaoji Chen

Abstract: With the rapid development of online advertising and recommendation systems, click-through rate prediction is expected to play an increasingly important role.Recently many DNN-based models which follow a similar Embedding&MLP paradigm have been proposed, and have achieved good result in image/voice and nlp fields. In these methods the Wide&Deep model announced by Google plays a key role.Most model… ▽ More With the rapid development of online advertising and recommendation systems, click-through rate prediction is expected to play an increasingly important role.Recently many DNN-based models which follow a similar Embedding&MLP paradigm have been proposed, and have achieved good result in image/voice and nlp fields. In these methods the Wide&Deep model announced by Google plays a key role.Most models first map large scale sparse input features into low-dimensional vectors which are transformed to fixed-length vectors, then concatenated together before being fed into a multilayer perceptron (MLP) to learn non-linear relations among input features. The number of trainable variables normally grow dramatically the number of feature fields and the embedding dimension grow. It is a big challenge to get state-of-the-art result through training deep neural network and embedding together, which falls into local optimal or overfitting easily. In this paper, we propose an Structured Semantic Model (SSM) to tackles this challenge by designing a orthogonal base convolution and pooling model which adaptively learn the multi-scale base semantic representation between features supervised by the click label.The output of SSM are then used in the Wide&Deep for CTR prediction.Experiments on two public datasets as well as real Weibo production dataset with over 1 billion samples have demonstrated the effectiveness of our proposed approach with superior performance comparing to state-of-the-art methods. △ Less

Submitted 29 April, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

arXiv:1811.02132 [pdf, other]

Student's t-Generative Adversarial Networks

Authors: Jinxuan Sun, Guoqiang Zhong, Yang Chen, Yongbin Liu, Tao Li, Zhongwen Guo

Abstract: Generative Adversarial Networks (GANs) have a great performance in image generation, but they need a large scale of data to train the entire framework, and often result in nonsensical results. We propose a new method referring to conditional GAN, which equipments the latent noise with mixture of Student's t-distribution with attention mechanism in addition to class information. Student's t-distrib… ▽ More Generative Adversarial Networks (GANs) have a great performance in image generation, but they need a large scale of data to train the entire framework, and often result in nonsensical results. We propose a new method referring to conditional GAN, which equipments the latent noise with mixture of Student's t-distribution with attention mechanism in addition to class information. Student's t-distribution has long tails that can provide more diversity to the latent noise. Meanwhile, the discriminator in our model implements two tasks simultaneously, judging whether the images come from the true data distribution, and identifying the class of each generated images. The parameters of the mixture model can be learned along with those of GANs. Moreover, we mathematically prove that any multivariate Student's t-distribution can be obtained by a linear transformation of a normal multivariate Student's t-distribution. Experiments comparing the proposed method with typical GAN, DeliGAN and DCGAN indicate that, our method has a great performance on generating diverse and legible objects with limited data. △ Less

Submitted 5 November, 2018; originally announced November 2018.

arXiv:1810.13155 [pdf, other]

Structure Learning of Deep Neural Networks with Q-Learning

Authors: Guoqiang Zhong, Wencong Jiao, Wei Gao

Abstract: Recently, with convolutional neural networks gaining significant achievements in many challenging machine learning fields, hand-crafted neural networks no longer satisfy our requirements as designing a network will cost a lot, and automatically generating architectures has attracted increasingly more attention and focus. Some research on auto-generated networks has achieved promising results. Howe… ▽ More Recently, with convolutional neural networks gaining significant achievements in many challenging machine learning fields, hand-crafted neural networks no longer satisfy our requirements as designing a network will cost a lot, and automatically generating architectures has attracted increasingly more attention and focus. Some research on auto-generated networks has achieved promising results. However, they mainly aim at picking a series of single layers such as convolution or pooling layers one by one. There are many elegant and creative designs in the carefully hand-crafted neural networks, such as Inception-block in GoogLeNet, residual block in residual network and dense block in dense convolutional network. Based on reinforcement learning and taking advantages of the superiority of these networks, we propose a novel automatic process to design a multi-block neural network, whose architecture contains multiple types of blocks mentioned above, with the purpose to do structure learning of deep neural networks and explore the possibility whether different blocks can be composed together to form a well-behaved neural network. The optimal network is created by the Q-learning agent who is trained to sequentially pick different types of blocks. To verify the validity of our proposed method, we use the auto-generated multi-block neural network to conduct experiments on image benchmark datasets MNIST, SVHN and CIFAR-10 image classification task with restricted computational resources. The results demonstrate that our method is very effective, achieving comparable or better performance than hand-crafted networks and advanced auto-generated neural networks. △ Less

Submitted 31 October, 2018; originally announced October 2018.

arXiv:1810.12754 [pdf, other]

Recurrent Attention Unit

Authors: Guoqiang Zhong, Guohua Yue, Xiao Ling

Abstract: Recurrent Neural Network (RNN) has been successfully applied in many sequence learning problems. Such as handwriting recognition, image description, natural language processing and video motion analysis. After years of development, researchers have improved the internal structure of the RNN and introduced many variants. Among others, Gated Recurrent Unit (GRU) is one of the most widely used RNN mo… ▽ More Recurrent Neural Network (RNN) has been successfully applied in many sequence learning problems. Such as handwriting recognition, image description, natural language processing and video motion analysis. After years of development, researchers have improved the internal structure of the RNN and introduced many variants. Among others, Gated Recurrent Unit (GRU) is one of the most widely used RNN model. However, GRU lacks the capability of adaptively paying attention to certain regions or locations, so that it may cause information redundancy or loss during leaning. In this paper, we propose a RNN model, called Recurrent Attention Unit (RAU), which seamlessly integrates the attention mechanism into the interior of GRU by adding an attention gate. The attention gate can enhance GRU's ability to remember long-term memory and help memory cells quickly discard unimportant content. RAU is capable of extracting information from the sequential data by adaptively selecting a sequence of regions or locations and pay more attention to the selected regions during learning. Extensive experiments on image classification, sentiment classification and language modeling show that RAU consistently outperforms GRU and other baseline methods. △ Less

Submitted 30 October, 2018; originally announced October 2018.

arXiv:1810.12752 [pdf, other]

Long Short-Term Attention

Authors: Guoqiang Zhong, Xin Lin, Kang Chen, Qingyang Li, Kaizhu Huang

Abstract: Attention is an important cognition process of humans, which helps humans concentrate on critical information during their perception and learning. However, although many machine learning models can remember information of data, they have no the attention mechanism. For example, the long short-term memory (LSTM) network is able to remember sequential information, but it cannot pay special attentio… ▽ More Attention is an important cognition process of humans, which helps humans concentrate on critical information during their perception and learning. However, although many machine learning models can remember information of data, they have no the attention mechanism. For example, the long short-term memory (LSTM) network is able to remember sequential information, but it cannot pay special attention to part of the sequences. In this paper, we present a novel model called long short-term attention (LSTA), which seamlessly integrates the attention mechanism into the inner cell of LSTM. More than processing long short term dependencies, LSTA can focus on important information of the sequences with the attention mechanism. Extensive experiments demonstrate that LSTA outperforms LSTM and related models on the sequence learning tasks. △ Less

Submitted 4 September, 2019; v1 submitted 30 October, 2018; originally announced October 2018.

arXiv:1810.10687 [pdf, other]

Structure Learning of Deep Networks via DNA Computing Algorithm

Authors: Guoqiang Zhong, Tao Li, Wenxue Liu, Yang Chen

Abstract: Convolutional Neural Network (CNN) has gained state-of-the-art results in many pattern recognition and computer vision tasks. However, most of the CNN structures are manually designed by experienced researchers. Therefore, auto- matically building high performance networks becomes an important problem. In this paper, we introduce the idea of using DNA computing algorithm to automatically learn hig… ▽ More Convolutional Neural Network (CNN) has gained state-of-the-art results in many pattern recognition and computer vision tasks. However, most of the CNN structures are manually designed by experienced researchers. Therefore, auto- matically building high performance networks becomes an important problem. In this paper, we introduce the idea of using DNA computing algorithm to automatically learn high-performance architectures. In DNA computing algorithm, we use short DNA strands to represent layers and long DNA strands to represent overall networks. We found that most of the learned models perform similarly, and only those performing worse during the first runs of training will perform worse finally than others. The indicates that: 1) Using DNA computing algorithm to learn deep architectures is feasible; 2) Local minima should not be a problem of deep networks; 3) We can use early stop to kill the models with the bad performance just after several runs of training. In our experiments, an accuracy 99.73% was obtained on the MNIST data set and an accuracy 95.10% was obtained on the CIFAR-10 data set. △ Less

Submitted 24 October, 2018; originally announced October 2018.

arXiv:1810.07550 [pdf]

The Newton Scheme for Deep Learning

Authors: Junqing Qiu, Guoren Zhong, Yihua Lu, Kun Xin, Huihuan Qian, Xi Zhu

Abstract: We introduce a neural network (NN) strictly governed by Newton's Law, with the nature required basis functions derived from the fundamental classic mechanics. Then, by classifying the training model as a quick procedure of 'force pattern' recognition, we developed the Newton physics-based NS scheme. Once the force pattern is confirmed, the neuro network simply does the checking of the 'pattern sta… ▽ More We introduce a neural network (NN) strictly governed by Newton's Law, with the nature required basis functions derived from the fundamental classic mechanics. Then, by classifying the training model as a quick procedure of 'force pattern' recognition, we developed the Newton physics-based NS scheme. Once the force pattern is confirmed, the neuro network simply does the checking of the 'pattern stability' instead of the continuous fitting by computational resource consuming big data-driven processing. In the given physics's law system, once the field is confirmed, the mathematics bases for the force field description actually are not diverged but denumerable, which can save the function representations from the exhaustible available mathematics bases. In this work, we endorsed Newton's Law into the deep learning technology and proposed Newton Scheme (NS). Under NS, the user first identifies the path pattern, like the constant acceleration movement.The object recognition technology first loads mass information, then, the NS finds the matched physical pattern and describe and predict the trajectory of the movements with nearly zero error. We compare the major contribution of this NS with the TCN, GRU and other physics inspired 'FIND-PDE' methods to demonstrate fundamental and extended applications of how the NS works for the free-falling, pendulum and curve soccer balls.The NS methodology provides more opportunity for the future deep learning advances. △ Less

Submitted 16 October, 2018; originally announced October 2018.

Comments: 7 pages, 10 figures

arXiv:1807.03923 [pdf, other]

Generative Adversarial Networks with Decoder-Encoder Output Noise

Authors: Guoqiang Zhong, Wei Gao, Yongbin Liu, Youzhao Yang

Abstract: In recent years, research on image generation methods has been developing fast. The auto-encoding variational Bayes method (VAEs) was proposed in 2013, which uses variational inference to learn a latent space from the image database and then generates images using the decoder. The generative adversarial networks (GANs) came out as a promising framework, which uses adversarial training to improve t… ▽ More In recent years, research on image generation methods has been developing fast. The auto-encoding variational Bayes method (VAEs) was proposed in 2013, which uses variational inference to learn a latent space from the image database and then generates images using the decoder. The generative adversarial networks (GANs) came out as a promising framework, which uses adversarial training to improve the generative ability of the generator. However, the generated pictures by GANs are generally blurry. The deep convolutional generative adversarial networks (DCGANs) were then proposed to leverage the quality of generated images. Since the input noise vectors are randomly sampled from a Gaussian distribution, the generator has to map from a whole normal distribution to the images. This makes DCGANs unable to reflect the inherent structure of the training data. In this paper, we propose a novel deep model, called generative adversarial networks with decoder-encoder output noise (DE-GANs), which takes advantage of both the adversarial training and the variational Bayesain inference to improve the performance of image generation. DE-GANs use a pre-trained decoder-encoder architecture to map the random Gaussian noise vectors to informative ones and pass them to the generator of the adversarial networks. Since the decoder-encoder architecture is trained by the same images as the generators, the output vectors could carry the intrinsic distribution information of the original images. Moreover, the loss function of DE-GANs is different from GANs and DCGANs. A hidden-space loss function is added to the adversarial loss function to enhance the robustness of the model. Extensive empirical results show that DE-GANs can accelerate the convergence of the adversarial training process and improve the quality of the generated images. △ Less

Submitted 10 July, 2018; originally announced July 2018.

arXiv:1804.08758 [pdf, other]

Switchable Temporal Propagation Network

Authors: Sifei Liu, Guangyu Zhong, Shalini De Mello, Jinwei Gu, Varun Jampani, Ming-Hsuan Yang, Jan Kautz

Abstract: Videos contain highly redundant information between frames. Such redundancy has been extensively studied in video compression and encoding, but is less explored for more advanced video processing. In this paper, we propose a learnable unified framework for propagating a variety of visual properties of video images, including but not limited to color, high dynamic range (HDR), and segmentation info… ▽ More Videos contain highly redundant information between frames. Such redundancy has been extensively studied in video compression and encoding, but is less explored for more advanced video processing. In this paper, we propose a learnable unified framework for propagating a variety of visual properties of video images, including but not limited to color, high dynamic range (HDR), and segmentation information, where the properties are available for only a few key-frames. Our approach is based on a temporal propagation network (TPN), which models the transition-related affinity between a pair of frames in a purely data-driven manner. We theoretically prove two essential factors for TPN: (a) by regularizing the global transformation matrix as orthogonal, the "style energy" of the property can be well preserved during propagation; (b) such regularization can be achieved by the proposed switchable TPN with bi-directional training on pairs of frames. We apply the switchable TPN to three tasks: colorizing a gray-scale video based on a few color key-frames, generating an HDR video from a low dynamic range (LDR) video and a few HDR frames, and propagating a segmentation mask from the first frame in videos. Experimental results show that our approach is significantly more accurate and efficient than the state-of-the-art methods. △ Less

Submitted 4 May, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

arXiv:1804.00706 [pdf, other]

doi 10.1145/3301278

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Authors: Guanwen Zhong, Akshat Dubey, Tan Cheng, Tulika Mitra

Abstract: Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and IoT devices have necessitated real-time, energy-efficient deep neural network inference on em… ▽ More Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and IoT devices have necessitated real-time, energy-efficient deep neural network inference on embedded-class, resource-constrained platforms. In this context, we present {\em Synergy}, an automated, hardware-software co-designed, pipelined, high-throughput CNN inference framework on embedded heterogeneous system-on-chip (SoC) architectures (Xilinx Zynq). {\em Synergy} leverages, through multi-threading, all the available on-chip resources, which includes the dual-core ARM processor along with the FPGA and the NEON SIMD engines as accelerators. Moreover, {\em Synergy} provides a unified abstraction of the heterogeneous accelerators (FPGA and NEON) and can adapt to different network configurations at runtime without changing the underlying hardware accelerator architecture by balancing workload across accelerators through work-stealing. {\em Synergy} achieves 7.3X speedup, averaged across seven CNN models, over a well-optimized software-only solution. {\em Synergy} demonstrates substantially better throughput and energy-efficiency compared to the contemporary CNN implementations on the same SoC architecture. △ Less

Submitted 28 March, 2018; originally announced April 2018.

Comments: 34 pages, submitted to ACM Transactions on Embedded Computing Systems (TECS)

ACM Class: C.1.3

Journal ref: TECS, 18 (2019) 13-39

arXiv:1801.10281 [pdf, other]

Learning Video-Story Composition via Recurrent Neural Network

Authors: Guangyu Zhong, Yi-Hsuan Tsai, Sifei Liu, Zhixun Su, Ming-Hsuan Yang

Abstract: In this paper, we propose a learning-based method to compose a video-story from a group of video clips that describe an activity or experience. We learn the coherence between video clips from real videos via the Recurrent Neural Network (RNN) that jointly incorporates the spatial-temporal semantics and motion dynamics to generate smooth and relevant compositions. We further rearrange the results g… ▽ More In this paper, we propose a learning-based method to compose a video-story from a group of video clips that describe an activity or experience. We learn the coherence between video clips from real videos via the Recurrent Neural Network (RNN) that jointly incorporates the spatial-temporal semantics and motion dynamics to generate smooth and relevant compositions. We further rearrange the results generated by the RNN to make the overall video-story compatible with the storyline structure via a submodular ranking optimization process. Experimental results on the video-story dataset show that the proposed algorithm outperforms the state-of-the-art approach. △ Less

Submitted 30 January, 2018; originally announced January 2018.

arXiv:1710.01020 [pdf, other]

Learning Affinity via Spatial Propagation Networks

Authors: Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz

Abstract: In this paper, we propose spatial propagation networks for learning the affinity matrix for vision tasks. We show that by constructing a row/column linear propagation model, the spatially varying transformation matrix exactly constitutes an affinity matrix that models dense, global pairwise relationships of an image. Specifically, we develop a three-way connection for the linear propagation model,… ▽ More In this paper, we propose spatial propagation networks for learning the affinity matrix for vision tasks. We show that by constructing a row/column linear propagation model, the spatially varying transformation matrix exactly constitutes an affinity matrix that models dense, global pairwise relationships of an image. Specifically, we develop a three-way connection for the linear propagation model, which (a) formulates a sparse transformation matrix, where all elements can be the output from a deep CNN, but (b) results in a dense affinity matrix that effectively models any task-specific pairwise similarity matrix. Instead of designing the similarity kernels according to image features of two points, we can directly output all the similarities in a purely data-driven manner. The spatial propagation network is a generic framework that can be applied to many affinity-related tasks, including but not limited to image matting, segmentation and colorization, to name a few. Essentially, the model can learn semantically-aware affinity values for high-level vision tasks due to the powerful learning capability of the deep neural network classifier. We validate the framework on the task of refinement for image segmentation boundaries. Experiments on the HELEN face parsing and PASCAL VOC-2012 semantic segmentation tasks show that the spatial propagation network provides a general, effective and efficient solution for generating high-quality segmentation results. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Comments: A long version of NIPS 2017

arXiv:1708.02191 [pdf, other]

Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Authors: Kihyuk Sohn, Sifei Liu, Guangyu Zhong, Xiang Yu, Ming-Hsuan Yang, Manmohan Chandraker

Abstract: Despite rapid advances in face recognition, there remains a clear gap between the performance of still image-based face recognition and video-based face recognition, due to the vast difference in visual quality between the domains and the difficulty of curating diverse large-scale video datasets. This paper addresses both of those challenges, through an image to video feature-level domain adaptati… ▽ More Despite rapid advances in face recognition, there remains a clear gap between the performance of still image-based face recognition and video-based face recognition, due to the vast difference in visual quality between the domains and the difficulty of curating diverse large-scale video datasets. This paper addresses both of those challenges, through an image to video feature-level domain adaptation approach, to learn discriminative video frame representations. The framework utilizes large-scale unlabeled video data to reduce the gap between different domains while transferring discriminative knowledge from large-scale labeled still images. Given a face recognition network that is pretrained in the image domain, the adaptation is achieved by (i) distilling knowledge from the network to a video adaptation network through feature matching, (ii) performing feature restoration through synthetic data augmentation and (iii) learning a domain-invariant feature through a domain adversarial discriminator. We further improve performance through a discriminator-guided feature fusion that boosts high-quality frames while eliminating those degraded by video domain-specific factors. Experiments on the YouTube Faces and IJB-A datasets demonstrate that each module contributes to our feature-level domain adaptation framework and substantially improves video face recognition performance to achieve state-of-the-art accuracy. We demonstrate qualitatively that the network learns to suppress diverse artifacts in videos such as pose, illumination or occlusion without being explicitly trained for them. △ Less

Submitted 7 August, 2017; originally announced August 2017.

Comments: accepted for publication at International Conference on Computer Vision (ICCV) 2017

arXiv:1705.06861 [pdf, other]

doi 10.1109/LGRS.2017.2733548

Prediction of Sea Surface Temperature using Long Short-Term Memory

Authors: Qin Zhang, Hui Wang, Junyu Dong, Guoqiang Zhong, Xin Sun

Abstract: This letter adopts long short-term memory(LSTM) to predict sea surface temperature(SST), which is the first attempt, to our knowledge, to use recurrent neural network to solve the problem of SST prediction, and to make one week and one month daily prediction. We formulate the SST prediction problem as a time series regression problem. LSTM is a special kind of recurrent neural network, which intro… ▽ More This letter adopts long short-term memory(LSTM) to predict sea surface temperature(SST), which is the first attempt, to our knowledge, to use recurrent neural network to solve the problem of SST prediction, and to make one week and one month daily prediction. We formulate the SST prediction problem as a time series regression problem. LSTM is a special kind of recurrent neural network, which introduces gate mechanism into vanilla RNN to prevent the vanished or exploding gradient problem. It has strong ability to model the temporal relationship of time series data and can handle the long-term dependency problem well. The proposed network architecture is composed of two kinds of layers: LSTM layer and full-connected dense layer. LSTM layer is utilized to model the time series relationship. Full-connected layer is utilized to map the output of LSTM layer to a final prediction. We explore the optimal setting of this architecture by experiments and report the accuracy of coastal seas of China to confirm the effectiveness of the proposed method. In addition, we also show its online updated characteristics. △ Less

Submitted 19 May, 2017; originally announced May 2017.

Comments: 5 page, 5 figures

arXiv:1703.09784 [pdf, other]

Perception Driven Texture Generation

Authors: Yanhai Gan, Huifang Chi, Ying Gao, Jun Liu, Guoqiang Zhong, Junyu Dong

Abstract: This paper investigates a novel task of generating texture images from perceptual descriptions. Previous work on texture generation focused on either synthesis from examples or generation from procedural models. Generating textures from perceptual attributes have not been well studied yet. Meanwhile, perceptual attributes, such as directionality, regularity and roughness are important factors for… ▽ More This paper investigates a novel task of generating texture images from perceptual descriptions. Previous work on texture generation focused on either synthesis from examples or generation from procedural models. Generating textures from perceptual attributes have not been well studied yet. Meanwhile, perceptual attributes, such as directionality, regularity and roughness are important factors for human observers to describe a texture. In this paper, we propose a joint deep network model that combines adversarial training and perceptual feature regression for texture generation, while only random noise and user-defined perceptual attributes are required as input. In this model, a preliminary trained convolutional neural network is essentially integrated with the adversarial framework, which can drive the generated textures to possess given perceptual attributes. An important aspect of the proposed model is that, if we change one of the input perceptual features, the corresponding appearance of the generated textures will also be changed. We design several experiments to validate the effectiveness of the proposed method. The results show that the proposed method can produce high quality texture images with desired perceptual properties. △ Less

Submitted 23 March, 2017; originally announced March 2017.

Comments: 7 pages, 4 figures, icme2017

arXiv:1611.08331 [pdf, ps, other]

An Overview on Data Representation Learning: From Traditional Feature Learning to Recent Deep Learning

Authors: Guoqiang Zhong, Li-Na Wang, Junyu Dong

Abstract: Since about 100 years ago, to learn the intrinsic structure of data, many representation learning approaches have been proposed, including both linear ones and nonlinear ones, supervised ones and unsupervised ones. Particularly, deep architectures are widely applied for representation learning in recent years, and have delivered top results in many tasks, such as image classification, object detec… ▽ More Since about 100 years ago, to learn the intrinsic structure of data, many representation learning approaches have been proposed, including both linear ones and nonlinear ones, supervised ones and unsupervised ones. Particularly, deep architectures are widely applied for representation learning in recent years, and have delivered top results in many tasks, such as image classification, object detection and speech recognition. In this paper, we review the development of data representation learning methods. Specifically, we investigate both traditional feature learning algorithms and state-of-the-art deep learning models. The history of data representation learning is introduced, while available resources (e.g. online course, tutorial and book information) and toolboxes are provided. Finally, we conclude this paper with remarks and some interesting research directions on data representation learning. △ Less

Submitted 24 November, 2016; originally announced November 2016.

Comments: About 20 pages. Submitted to Journal of Finance and Data Science as an invited paper

MSC Class: 68T05

arXiv:1507.06105 [pdf, other]

Banzhaf Random Forests

Authors: Jianyuan Sun, Guoqiang Zhong, Junyu Dong, Yajuan Cai

Abstract: Random forests are a type of ensemble method which makes predictions by combining the results of several independent trees. However, the theory of random forests has long been outpaced by their application. In this paper, we propose a novel random forests algorithm based on cooperative game theory. Banzhaf power index is employed to evaluate the power of each feature by traversing possible feature… ▽ More Random forests are a type of ensemble method which makes predictions by combining the results of several independent trees. However, the theory of random forests has long been outpaced by their application. In this paper, we propose a novel random forests algorithm based on cooperative game theory. Banzhaf power index is employed to evaluate the power of each feature by traversing possible feature coalitions. Unlike the previously used information gain rate of information theory, which simply chooses the most informative feature, the Banzhaf power index can be considered as a metric of the importance of each feature on the dependency among a group of features. More importantly, we have proved the consistency of the proposed algorithm, named Banzhaf random forests (BRF). This theoretical analysis takes a step towards narrowing the gap between the theory and practice of random forests for classification problems. Experiments on several UCI benchmark data sets show that BRF is competitive with state-of-the-art classifiers and dramatically outperforms previous consistent random forests. Particularly, it is much more efficient than previous consistent random forests. △ Less

Submitted 22 July, 2015; originally announced July 2015.

Comments: arXiv admin note: text overlap with arXiv:1302.4853 by other authors

arXiv:1507.04437 [pdf, ps, other]

A Deep Hashing Learning Network

Authors: Guoqiang Zhong, Pan Yang, Sijiang Wang, Junyu Dong

Abstract: Hashing-based methods seek compact and efficient binary codes that preserve the neighborhood structure in the original data space. For most existing hashing methods, an image is first encoded as a vector of hand-crafted visual feature, followed by a hash projection and quantization step to get the compact binary vector. Most of the hand-crafted features just encode the low-level information of the… ▽ More Hashing-based methods seek compact and efficient binary codes that preserve the neighborhood structure in the original data space. For most existing hashing methods, an image is first encoded as a vector of hand-crafted visual feature, followed by a hash projection and quantization step to get the compact binary vector. Most of the hand-crafted features just encode the low-level information of the input, the feature may not preserve the semantic similarities of images pairs. Meanwhile, the hashing function learning process is independent with the feature representation, so the feature may not be optimal for the hashing projection. In this paper, we propose a supervised hashing method based on a well designed deep convolutional neural network, which tries to learn hashing code and compact representations of data simultaneously. The proposed model learn the binary codes by adding a compact sigmoid layer before the loss layer. Experiments on several image data sets show that the proposed model outperforms other state-of-the-art methods. △ Less

Submitted 15 July, 2015; originally announced July 2015.

Comments: 7 pages, 5 figures

arXiv:1505.03703 [pdf, other]

A PCA-Based Convolutional Network

Authors: Yanhai Gan, Jun Liu, Junyu Dong, Guoqiang Zhong

Abstract: In this paper, we propose a novel unsupervised deep learning model, called PCA-based Convolutional Network (PCN). The architecture of PCN is composed of several feature extraction stages and a nonlinear output stage. Particularly, each feature extraction stage includes two layers: a convolutional layer and a feature pooling layer. In the convolutional layer, the filter banks are simply learned by… ▽ More In this paper, we propose a novel unsupervised deep learning model, called PCA-based Convolutional Network (PCN). The architecture of PCN is composed of several feature extraction stages and a nonlinear output stage. Particularly, each feature extraction stage includes two layers: a convolutional layer and a feature pooling layer. In the convolutional layer, the filter banks are simply learned by PCA. In the nonlinear output stage, binary hashing is applied. For the higher convolutional layers, the filter banks are learned from the feature maps that were obtained in the previous stage. To test PCN, we conducted extensive experiments on some challenging tasks, including handwritten digits recognition, face recognition and texture classification. The results show that PCN performs competitive with or even better than state-of-the-art deep learning models. More importantly, since there is no back propagation for supervised finetuning, PCN is much more efficient than existing deep networks. △ Less

Submitted 14 May, 2015; originally announced May 2015.

Comments: 8 pages,5 figures

arXiv:1306.2663 [pdf, other]

Large Margin Low Rank Tensor Analysis

Authors: Guoqiang Zhong, Mohamed Cheriet

Abstract: Other than vector representations, the direct objects of human cognition are generally high-order tensors, such as 2D images and 3D textures. From this fact, two interesting questions naturally arise: How does the human brain represent these tensor perceptions in a "manifold" way, and how can they be recognized on the "manifold"? In this paper, we present a supervised model to learn the intrinsic… ▽ More Other than vector representations, the direct objects of human cognition are generally high-order tensors, such as 2D images and 3D textures. From this fact, two interesting questions naturally arise: How does the human brain represent these tensor perceptions in a "manifold" way, and how can they be recognized on the "manifold"? In this paper, we present a supervised model to learn the intrinsic structure of the tensors embedded in a high dimensional Euclidean space. With the fixed point continuation procedures, our model automatically and jointly discovers the optimal dimensionality and the representations of the low dimensional embeddings. This makes it an effective simulation of the cognitive process of human brain. Furthermore, the generalization of our model based on similarity between the learned low dimensional embeddings can be viewed as counterpart of recognition of human brain. Experiments on applications for object recognition and face recognition demonstrate the superiority of our proposed model over state-of-the-art approaches. △ Less

Submitted 11 June, 2013; originally announced June 2013.

Comments: 30 pages

MSC Class: 57-04

Showing 1–37 of 37 results for author: Zhong, G