Search | arXiv e-print repository

Self-supervised Topic Taxonomy Discovery in the Box Embedding Space

Authors: Yuyin Lu, Hegang Chen, Pengbo Mao, Yanghui Rao, Haoran Xie, Fu Lee Wang, Qing Li

Abstract: Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Unfortunately, most of prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption. What's worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, ex… ▽ More Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Unfortunately, most of prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption. What's worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, existing methods suffer from problems of low-quality topics at high abstraction levels and inaccurate hierarchical relations. To alleviate these problems, this paper develops a Box embedding-based Topic Model (BoxTM) that maps words and topics into the box embedding space, where the asymmetric metric is defined to properly infer hierarchical relations among topics. Additionally, our BoxTM explicitly infers upper-level topics based on correlation between specific topics through recursive clustering on topic boxes. Finally, extensive experiments validate high-quality of the topic taxonomy learned by BoxTM. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: to be published in TACL

arXiv:2408.00754 [pdf, other]

Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model

Authors: Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, Ranjay Krishna

Abstract: Multimodal language models (MLLMs) are increasingly being implemented in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions. We introduce Coarse Correspondence, a simple, training-free, effective, an… ▽ More Multimodal language models (MLLMs) are increasingly being implemented in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions. We introduce Coarse Correspondence, a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs. Our method uses a lightweight tracking model to find object correspondences between frames in a video or between sets of image viewpoints. It selects the most frequent object instances and visualizes them with markers with unique IDs in the image. With this simple approach, we achieve state-of-the-art results on 3D understanding benchmarks including ScanQA (+20.5\%) and a subset of OpenEQA (+9.7\%), and on long-form video benchmarks such as EgoSchema (+6.0\%). We also curate a small diagnostic dataset to evaluate whether MLLMs can reason about space from a described viewpoint other than the camera viewpoint. Again, Coarse Correspondence improves spatial perspective-taking abilities but we highlight that MLLMs struggle with this task. Together, we demonstrate that our simple prompting method can significantly aid downstream tasks that require 3D or temporal reasoning. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: project page: https://coarse-correspondence.github.io

arXiv:2407.18121 [pdf, other]

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Authors: Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu

Abstract: In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in th… ▽ More In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an importance-driven cache merging strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their distance with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at https://github.com/liuzuyan/ElasticCache △ Less

Submitted 25 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.16883 [pdf, other]

A Standardized Machine-readable Dataset Documentation Format for Responsible AI

Authors: Nitisha Jain, Mubashara Akhtar, Joan Giner-Miguelez, Rajat Shinde, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Yuhan Rao, Tim Santos, Luis Oala, Michalis Karamousadakis, Manil Maskey, Pierre Marcenac, Costanza Conforti, Michael Kuchnik, Lora Aroyo, Omar Benjelloun, Elena Simperl

Abstract: Data is critical to advancing AI technologies, yet its quality and documentation remain significant challenges, leading to adverse downstream effects (e.g., potential biases) in AI applications. This paper addresses these issues by introducing Croissant-RAI, a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. Croissant-R… ▽ More Data is critical to advancing AI technologies, yet its quality and documentation remain significant challenges, leading to adverse downstream effects (e.g., potential biases) in AI applications. This paper addresses these issues by introducing Croissant-RAI, a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. Croissant-RAI extends the Croissant metadata format and builds upon existing responsible AI (RAI) documentation frameworks, offering a standardized set of attributes and practices to facilitate community-wide adoption. Leveraging established web-publishing practices, such as Schema.org, Croissant-RAI enables dataset users to easily find and utilize RAI metadata regardless of the platform on which the datasets are published. Furthermore, it is seamlessly integrated into major data search engines, repositories, and machine learning frameworks, streamlining the reading and writing of responsible AI metadata within practitioners' existing workflows. Croissant-RAI was developed through a community-led effort. It has been designed to be adaptable to evolving documentation requirements and is supported by a Python library and a visual editor. △ Less

Submitted 4 June, 2024; originally announced July 2024.

Comments: 10 pages, appendix

arXiv:2407.08349 [pdf]

Spine Vision X-Ray Image based GUI Planning of Pedicle Screws Using Enhanced YOLOv5 for Vertebrae Segmentation

Authors: Yashwanth Rao, Gaurisankar S, Durga R, Aparna Purayath, Vivek Maik, Manojkumar Lakshmanan, Mohanasankar Sivaprakasm

Abstract: In this paper, we propose an innovative Graphical User Interface (GUI) aimed at improving preoperative planning and intra-operative guidance for precise spinal screw placement through vertebrae segmentation. The methodology encompasses both front-end and back-end computations. The front end comprises a GUI that allows surgeons to precisely adjust the placement of screws on X-Ray images, thereby im… ▽ More In this paper, we propose an innovative Graphical User Interface (GUI) aimed at improving preoperative planning and intra-operative guidance for precise spinal screw placement through vertebrae segmentation. The methodology encompasses both front-end and back-end computations. The front end comprises a GUI that allows surgeons to precisely adjust the placement of screws on X-Ray images, thereby improving the simulation of surgical screw insertion in the patient's spine. On the other hand, the back-end processing involves several steps, including acquiring spinal X-ray images, performing pre-processing techniques to reduce noise, and training a neural network model to achieve real-time segmentation of the vertebrae. The integration of vertebral segmentation in the GUI ensures precise screw placement, reducing complications like nerve injury and ultimately improving surgical outcomes. The Spine-Vision provides a comprehensive solution with innovative features like synchronous AP-LP planning, accurate screw positioning via vertebrae segmentation, effective screw visualization, and dynamic position adjustments. This X-ray image-based GUI workflow emerges as a valuable tool, enhancing precision and safety in spinal screw placement and planning procedures. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2405.17934 [pdf, other]

Proof of Quality: A Costless Paradigm for Trustless Generative AI Model Inference on Blockchains

Authors: Zhenjie Zhang, Yuyang Rao, Hao Xiao, Xiaokui Xiao, Yin Yang

Abstract: Generative AI models, such as GPT-4 and Stable Diffusion, have demonstrated powerful and disruptive capabilities in natural language and image tasks. However, deploying these models in decentralized environments remains challenging. Unlike traditional centralized deployment, systematically guaranteeing the integrity of AI model services in fully decentralized environments, particularly on trustles… ▽ More Generative AI models, such as GPT-4 and Stable Diffusion, have demonstrated powerful and disruptive capabilities in natural language and image tasks. However, deploying these models in decentralized environments remains challenging. Unlike traditional centralized deployment, systematically guaranteeing the integrity of AI model services in fully decentralized environments, particularly on trustless blockchains, is both crucial and difficult. In this paper, we present a new inference paradigm called \emph{proof of quality} (PoQ) to enable the deployment of arbitrarily large generative models on blockchain architecture. Unlike traditional approaches based on validating inference procedures, such as ZKML or OPML, our PoQ paradigm focuses on the outcome quality of model inference. Using lightweight BERT-based cross-encoders as our underlying quality evaluation model, we design and implement PQML, the first practical protocol for real-world NLP generative model inference on blockchains, tailored for popular open-source models such as Llama 3 and Mixtral. Our analysis demonstrates that our protocol is robust against adversarial but rational participants in ecosystems, where lazy or dishonest behavior results in fewer benefits compared to well-behaving participants. The computational overhead of validating the quality evaluation is minimal, allowing quality validators to complete the quality check within a second, even using only a CPU. Preliminary simulation results show that PoQ consensus is generated in milliseconds, 1,000 times faster than any existing scheme. △ Less

Submitted 30 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 12 pages, 5 figures

arXiv:2404.19613 [pdf]

High-throughput discovery of metal oxides with high thermoelectric performance via interpretable feature engineering on small data

Authors: Shengluo Ma, Yongchao Rao, Xiang Huang, Shenghong Ju

Abstract: In this work, we have proposed a data-driven screening framework combining the interpretable machine learning with high-throughput calculations to identify a series of metal oxides that exhibit both high-temperature tolerance and high power factors. Aiming at the problem of weak generalization ability of small data with power factors at high temperatures, we employ symbolic regression for feature… ▽ More In this work, we have proposed a data-driven screening framework combining the interpretable machine learning with high-throughput calculations to identify a series of metal oxides that exhibit both high-temperature tolerance and high power factors. Aiming at the problem of weak generalization ability of small data with power factors at high temperatures, we employ symbolic regression for feature creation which enhances the robustness of the model while preserving the physical meaning of features. 33 candidate metal oxides are finally targeted for high-temperature thermoelectric applications from a pool of 48,694 compounds in the Materials Project database. The Boltzmann transport theory is utilized to perform electrical transport properties calculations at 1,000 K. The relaxation time is approximated by employing constant electron-phonon coupling based on the deformation potential theory. Considering band degeneracy, the electron group velocity is obtained using the momentum matrix element method, yielding 28 materials with power factors greater than 50 $μW cm^{-1} K^{-2} $. The high-throughput framework we proposed is instrumental in the selection of metal oxides for high-temperature thermoelectric applications. Furthermore, our data-driven analysis and transport calculation suggest that metal oxides rich in elements such as cerium (Ce), tin (Sn), and lead (Pb) tend to exhibit high power factors at high temperatures. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.15010 [pdf, other]

X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition

Authors: Shuofeng Sun, Yongming Rao, Jiwen Lu, Haibin Yan

Abstract: Numerous prior studies predominantly emphasize constructing relation vectors for individual neighborhood points and generating dynamic kernels for each vector and embedding these into high-dimensional spaces to capture implicit local structures. However, we contend that such implicit high-dimensional structure modeling approch inadequately represents the local geometric structure of point clouds d… ▽ More Numerous prior studies predominantly emphasize constructing relation vectors for individual neighborhood points and generating dynamic kernels for each vector and embedding these into high-dimensional spaces to capture implicit local structures. However, we contend that such implicit high-dimensional structure modeling approch inadequately represents the local geometric structure of point clouds due to the absence of explicit structural information. Hence, we introduce X-3D, an explicit 3D structure modeling approach. X-3D functions by capturing the explicit local structural information within the input 3D space and employing it to produce dynamic kernels with shared weights for all neighborhood points within the current local region. This modeling approach introduces effective geometric prior and significantly diminishes the disparity between the local structure of the embedding space and the original input point cloud, thereby improving the extraction of local features. Experiments show that our method can be used on a variety of methods and achieves state-of-the-art performance on segmentation, classification, detection tasks with lower extra computational cost, such as \textbf{90.7\%} on ScanObjectNN for classification, \textbf{79.2\%} on S3DIS 6 fold and \textbf{74.3\%} on S3DIS Area 5 for segmentation, \textbf{76.3\%} on ScanNetV2 for segmentation and \textbf{64.5\%} mAP , \textbf{46.9\%} mAP on SUN RGB-D and \textbf{69.0\%} mAP , \textbf{51.1\%} mAP on ScanNetV2 . Our code is available at \href{https://github.com/sunshuofeng/X-3D}{https://github.com/sunshuofeng/X-3D}. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

arXiv:2403.13276 [pdf]

All-magnonic repeater based on bistability

Authors: Qi Wang, Roman Verba, Kristyna Davidkova, Bjorn Heinz, Shixian Tian, Yiheng Rao, Mengying Guo, Xueyu Guo, Carsten Dubs, Philipp Pirro, Andrii V. Chumak

Abstract: Bistability, a universal phenomenon found in diverse fields such as biology, chemistry, and physics, describes a scenario in which a system has two stable equilibrium states and resets to one of the two states. The ability to switch between these two states is the basis for a wide range of applications, particularly in memory and logic operations. Here, we present a universal approach to achieve b… ▽ More Bistability, a universal phenomenon found in diverse fields such as biology, chemistry, and physics, describes a scenario in which a system has two stable equilibrium states and resets to one of the two states. The ability to switch between these two states is the basis for a wide range of applications, particularly in memory and logic operations. Here, we present a universal approach to achieve bistable switching in magnonics, the field processing data using spin waves. As an exemplary application, we use magnonic bistability to experimentally demonstrate the still missing magnonic repeater. A pronounced bistable window is observed in a 1um wide magnonic conduit under an external rf drive characterized by two magnonic stable states defined as low and high spin-wave amplitudes. The switching between these two states is realized by another propagating spin wave sent into the rf driven region. This magnonic bistable switching is used to design the magnonic repeater, which receives the original decayed and distorted spin wave and regenerates a new spin wave with amplified amplitude and normalized phase. Our magnonic repeater is proposed to be installed at the inputs of each magnonic logic gate to overcome the spin-wave amplitude degradation and phase distortion during previous propagation and achieve integrated magnonic circuits or magnonic neuromorphic networks. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 13 pages, 4 figures

arXiv:2403.12966 [pdf, other]

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Authors: Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu

Abstract: In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing… ▽ More In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing LVLMs is to utilize lower-resolution images, which restricts the ability for visual recognition. Our work introduces the Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows LVLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features. By integrating Chain-of-Spot with instruct-following LLaVA-1.5 models, the process of image reasoning consistently improves performance across a wide range of multimodal datasets and benchmarks without bells and whistles and achieves new state-of-the-art results. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications. Code and models are available at https://github.com/dongyh20/Chain-of-Spot △ Less

Submitted 21 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: Project Page: https://sites.google.com/view/chain-of-spot/

arXiv:2312.13286 [pdf, other]

Generative Multimodal Models are In-Context Learners

Authors: Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang

Abstract: The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multim… ▽ More The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research. △ Less

Submitted 7 May, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted to CVPR 2024. Project page: https://baaivision.github.io/emu2

arXiv:2312.10898 [pdf]

Replica symmetry breaking in 1D Rayleigh scattering system: theory and validations

Authors: Yifei Qi, Longqun Ni, Zhenyu Ye, Jiaojiao Zhang, Xingyu Bao, Pan Wang, Yunjiang Rao, Ernesto P. Raposo, Anderson S. L. Gomes, Zinan Wang

Abstract: Spin glass theory, as a paradigm for describing disordered magnetic systems, constitutes a prominent subject of study within statistical physics. Replica symmetry breaking (RSB), as one of the pivotal concepts for the understanding of spin glass theory, means that, under identical conditions disordered systems can yield distinct states with nontrivial correlations. Random fiber laser (RFL) based o… ▽ More Spin glass theory, as a paradigm for describing disordered magnetic systems, constitutes a prominent subject of study within statistical physics. Replica symmetry breaking (RSB), as one of the pivotal concepts for the understanding of spin glass theory, means that, under identical conditions disordered systems can yield distinct states with nontrivial correlations. Random fiber laser (RFL) based on Rayleigh scattering (RS) is a complex disordered system, owing to the disorder and stochasticity of RS. In this work, for the first time, we elaborate a precise theoretical model for studying the photonic phase transition via the platform of RS-based RFL, in which we clearly reveal that, apart from the pump power, the photon phase variation in RFL is also an analogy to the temperature term in spin glass phase transition, leading to a novel insight into the intrinsic mechanisms of photonic phase transition. In addition, based on this model and real-time high-fidelity detection spectral evolution, we theoretically predict and experimentally observe the mode-asymmetric characteristics of photonic phase transition in RS-based RFL. This finding contributes to a deeper understanding of the photonic RSB regime and the dynamics of RS-based RFL. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: 15 pages, 9 figures

arXiv:2312.06655 [pdf, other]

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Authors: Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan

Abstract: Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich deta… ▽ More Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Project page: https://liuff19.github.io/Sherpa3D/

arXiv:2312.04784 [pdf, other]

Reality's Canvas, Language's Brush: Crafting 3D Avatars from Monocular Video

Authors: Yuchen Rao, Eduardo Perez Pellitero, Benjamin Busam, Yiren Zhou, Jifei Song

Abstract: Recent advancements in 3D avatar generation excel with multi-view supervision for photorealistic models. However, monocular counterparts lag in quality despite broader applicability. We propose ReCaLaB to close this gap. ReCaLaB is a fully-differentiable pipeline that learns high-fidelity 3D human avatars from just a single RGB video. A pose-conditioned deformable NeRF is optimized to volumetrical… ▽ More Recent advancements in 3D avatar generation excel with multi-view supervision for photorealistic models. However, monocular counterparts lag in quality despite broader applicability. We propose ReCaLaB to close this gap. ReCaLaB is a fully-differentiable pipeline that learns high-fidelity 3D human avatars from just a single RGB video. A pose-conditioned deformable NeRF is optimized to volumetrically represent a human subject in canonical T-pose. The canonical representation is then leveraged to efficiently associate neural textures using 2D-3D correspondences. This enables the separation of diffused color generation and lighting correction branches that jointly compose an RGB prediction. The design allows to control intermediate results for human pose, body shape, texture, and lighting with text prompts. An image-conditioned diffusion model thereby helps to animate appearance and pose of the 3D avatar to create video sequences with previously unseen human motion. Extensive experiments show that ReCaLaB outperforms previous monocular approaches in terms of image quality for image synthesis tasks. Moreover, natural language offers an intuitive user interface for creative manipulation of 3D human avatars. △ Less

Submitted 24 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: Video link: https://youtu.be/Oz83z1es2J4

arXiv:2309.11857 [pdf, other]

TCOVIS: Temporally Consistent Online Video Instance Segmentation

Authors: Junlong Li, Bingyao Yu, Yongming Rao, Jie Zhou, Jiwen Lu

Abstract: In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a cha… ▽ More In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. Code is available at https://github.com/jun-long-li/TCOVIS. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: 11 pages, 4 figures. This paper has been accepted for ICCV 2023

arXiv:2309.05437 [pdf, ps, other]

Generation of three-dimensional cluster entangled state

Authors: Chan Roh, Geunhee Gwak, Young-Do Yoon, Young-Sik Ra

Abstract: Measurement-based quantum computing is a promising paradigm of quantum computation, where universal computing is achieved through a sequence of local measurements. The backbone of this approach is the preparation of multipartite entanglement, known as cluster states. While a cluster state with two-dimensional (2D) connectivity is required for universality, a three-dimensional (3D) cluster state is… ▽ More Measurement-based quantum computing is a promising paradigm of quantum computation, where universal computing is achieved through a sequence of local measurements. The backbone of this approach is the preparation of multipartite entanglement, known as cluster states. While a cluster state with two-dimensional (2D) connectivity is required for universality, a three-dimensional (3D) cluster state is necessary for additionally achieving fault tolerance. However, the challenge of making 3D connectivity has limited cluster state generation up to 2D. Here we demonstrate deterministic generation of a 3D cluster state based on the photonic continuous-variable platform. To realize 3D connectivity, we harness a crucial advantage of time-frequency modes of ultrafast quantum light: an arbitrary complex mode basis can be accessed directly, enabling connectivity as desired. We demonstrate the versatility of our method by generating cluster states with 1D, 2D, and 3D connectivities. For their complete characterization, we develop a quantum state tomography method for multimode Gaussian states. Moreover, we verify the cluster state generation by nullifier measurements as well as full inseparability tests. Our work paves the way toward fault-tolerant and universal measurement-based quantum computing. △ Less

Submitted 16 January, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2308.14912 [pdf, other]

Multi-wavelength observations of a B-class flare using XSM, AIA, and XRT

Authors: Yamini K. Rao, B. Mondal, Giulio Del Zanna, N. P. S. Mithun, S. V. Vadawale, K. K. Reeves, Helen E. Mason, Anil Bhardwaj

Abstract: We present multi-wavelength observations by Chandrayaan-2/XSM, SDO/AIA and Hinode/XRT of a B-class flare observed on 25th February, 2021, originating from an active region (AR 12804) near the North-West limb. The microflare lasts for approx 30 mins and is composed of hot loops reaching temperatures of 10 MK. We report excellent agreement (within 20 percent) for the average effective temperatures o… ▽ More We present multi-wavelength observations by Chandrayaan-2/XSM, SDO/AIA and Hinode/XRT of a B-class flare observed on 25th February, 2021, originating from an active region (AR 12804) near the North-West limb. The microflare lasts for approx 30 mins and is composed of hot loops reaching temperatures of 10 MK. We report excellent agreement (within 20 percent) for the average effective temperatures obtained at the flare peak from all the three instruments, which have different temperature sensitivities. The XRT filter combination of Be-thin and Be-med provides an excellent opportunity to measure the high-temperatures in such microflare events. The elemental abundances during the evolution of the microflare are also studied and observed to drop towards photospheric values at the flare peak time, compared to coronal values during the rise and decay phase. This is consistent with previous XSM studies. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: 18, pages, 18 figures, ApJ, Accepted

arXiv:2308.05221 [pdf, other]

Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI

Authors: Hangjie Shi, Leslie Ball, Govind Thattai, Desheng Zhang, Lucy Hu, Qiaozi Gao, Suhaila Shakiah, Xiaofeng Gao, Aishwarya Padmakumar, Bofei Yang, Cadence Chung, Dinakar Guthy, Gaurav Sukhatme, Karthika Arumugam, Matthew Wen, Osman Ipek, Patrick Lange, Rohan Khanna, Shreyas Pansare, Vasu Sharma, Chao Zhang, Cris Flagg, Daniel Pressel, Lavina Vaz, Luke Dai , et al. (17 additional authors not shown)

Abstract: The Alexa Prize program has empowered numerous university students to explore, experiment, and showcase their talents in building conversational agents through challenges like the SocialBot Grand Challenge and the TaskBot Challenge. As conversational agents increasingly appear in multimodal and embodied contexts, it is important to explore the affordances of conversational interaction augmented wi… ▽ More The Alexa Prize program has empowered numerous university students to explore, experiment, and showcase their talents in building conversational agents through challenges like the SocialBot Grand Challenge and the TaskBot Challenge. As conversational agents increasingly appear in multimodal and embodied contexts, it is important to explore the affordances of conversational interaction augmented with computer vision and physical embodiment. This paper describes the SimBot Challenge, a new challenge in which university teams compete to build robot assistants that complete tasks in a simulated physical environment. This paper provides an overview of the SimBot Challenge, which included both online and offline challenge phases. We describe the infrastructure and support provided to the teams including Alexa Arena, the simulated environment, and the ML toolkit provided to teams to accelerate their building of vision and language models. We summarize the approaches the participating teams took to overcome research challenges and extract key lessons learned. Finally, we provide analysis of the performance of the competing SimBots during the competition. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2307.14971 [pdf, other]

Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models

Authors: Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu

Abstract: With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-t… ▽ More With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks. Code is available at https://github.com/wangzy22/TAP. △ Less

Submitted 7 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted to ICCV 2023, project page: https://tap.ivg-research.xyz

arXiv:2306.11175 [pdf]

Developing Digital Twins for Earth Systems: Purpose, Requisites, and Benefits

Authors: Yuhan Rao, Rob Redmon, Kirstine Dale, Sue E. Haupt, Aaron Hopkinson, Ann Bostrom, Sid Boukabara, Thomas Geenen, David M. Hall, Benjamin D. Smith, Dev Niyogi, V. Ramaswamy, Eric A. Kihn

Abstract: The accelerated change in our planet due to human activities has led to grand societal challenges including health crises, intensified extreme weather events, food security, environmental injustice, etc. Digital twin systems combined with emerging technologies such as artificial intelligence and edge computing provide opportunities to support planning and decision-making to address these challenge… ▽ More The accelerated change in our planet due to human activities has led to grand societal challenges including health crises, intensified extreme weather events, food security, environmental injustice, etc. Digital twin systems combined with emerging technologies such as artificial intelligence and edge computing provide opportunities to support planning and decision-making to address these challenges. Digital twins for Earth systems (DT4ESs) are defined as the digital representation of the complex integrated Earth system including both natural processes and human activities. They have the potential to enable a diverse range of users to explore what-if scenarios across spatial and temporal scales to improve our understanding, prediction, mitigation, and adaptation to grand societal challenges. The 4th NOAA AI Workshop convened around 100 members who are developing or interested in participating in the development of DT4ES to discuss a shared community vision and path forward on fostering a future ecosystem of interoperable DT4ES. This paper summarizes the workshop discussions around DT4ES. We first defined the foundational features of a viable digital twins for Earth system that can be used to guide the development of various use cases of DT4ES. Finally, we made practical recommendations for the community on different aspects of collaboration in order to enable a future ecosystem of interoperable DT4ES, including equity-centered use case development, community-driven investigation of interoperability for DT4ES, trust-oriented co-development, and developing a community of practice. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: This whitepaper is an outcome of the 4th NOAA AI Workshop

arXiv:2305.06852 [pdf, other]

doi 10.1126/sciadv.adi5261

Recovering quantum entanglement after its certification

Authors: Hyeon-Jin Kim, Ji-Hyeok Jung, Kyung-Jun Lee, Young-Sik Ra

Abstract: Entanglement is a crucial quantum resource with broad applications in quantum information science. For harnessing entanglement in practice, it is a prerequisite to certify the entanglement of a given quantum state. However, the certification process itself destroys the entanglement, thereby precluding further exploitation of the entanglement. Resolving this conflict, here we present a protocol tha… ▽ More Entanglement is a crucial quantum resource with broad applications in quantum information science. For harnessing entanglement in practice, it is a prerequisite to certify the entanglement of a given quantum state. However, the certification process itself destroys the entanglement, thereby precluding further exploitation of the entanglement. Resolving this conflict, here we present a protocol that certifies the entanglement of a quantum state without complete destruction, and then, probabilistically recovers the original entanglement to provide useful entanglement for further quantum applications. We experimentally demonstrate this protocol in a photonic quantum system, and highlight its usefulness for selecting high-quality entanglement from a realistic entanglement source. Moreover, our study reveals various tradeoff relations among the physical quantities involved in the protocol. Our results show how entanglement certification can be made compatible with subsequent quantum applications, and more importantly, be beneficial to sort entanglement for better performance in quantum technologies. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Journal ref: Sci. Adv. 9, eadi5261 (2023)

arXiv:2303.04060 [pdf, other]

doi 10.1103/PhysRevB.108.085413

Large modulation of thermal transport in 2D semimetal triphosphides by doping-induced electron-phonon coupling

Authors: Yongchao Rao, C. Y. Zhao, Lei Shen, Shenghong Ju

Abstract: Recent studies demonstrate that novel 2D triphosphides semiconductors possess high carrier mobility and promising thermoelectric performance, while the carrier transport behaviors in 2D semimetal triphosphides have never been elucidated before. Herein, using the first-principles calculations and Boltzmann transport theory, we reveal that the electron-phonon coupling can be significant and thus gre… ▽ More Recent studies demonstrate that novel 2D triphosphides semiconductors possess high carrier mobility and promising thermoelectric performance, while the carrier transport behaviors in 2D semimetal triphosphides have never been elucidated before. Herein, using the first-principles calculations and Boltzmann transport theory, we reveal that the electron-phonon coupling can be significant and thus greatly inhibits the electron and phonon transport in electron-doped BP3 and CP3. The intrinsic heat transport capacity of flexural acoustic phonon modes in the wrinkle structure is largely suppressed arising from the strong out-of-plane phonon scatterings, leading to the low phonon thermal conductivity of 1.36 and 5.33 W/(mK) for BP3 and CP3 at room temperature, and at high doping level, the enhanced scattering from electron diminishes the phonon thermal conductivity by 71% and 54% for BP3 and CP3, respectively. Instead, electron thermal conductivity shows nonmonotonic variations with the increase of doping concentration, stemming from the competition between electron-phonon scattering rates and electron group velocity. It is worth noting that the heavy-doping effect induced strong scattering from phonon largely suppresses the electron transport and reduces electron thermal conductivity to the magnitude of phonon thermal conductivity. This work sheds light on the electron and phonon transport properties in semimetal triphosphides monolayer and provides an efficient avenue for the modulation of carrier transport by doping-induced electron-phonon coupling effect. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Journal ref: Phys. Rev. B 108, 085413, 2023

arXiv:2303.02153 [pdf, other]

Unleashing Text-to-Image Diffusion Models for Visual Perception

Authors: Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu

Abstract: Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models cont… ▽ More Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: project page: https://vpd.ivg-research.xyz

arXiv:2303.01586 [pdf, other]

Alexa Arena: A User-Centric Interactive Platform for Embodied AI

Authors: Qiaozi Gao, Govind Thattai, Suhaila Shakiah, Xiaofeng Gao, Shreyas Pansare, Vasu Sharma, Gaurav Sukhatme, Hangjie Shi, Bofei Yang, Desheng Zheng, Lucy Hu, Karthika Arumugam, Shui Hu, Matthew Wen, Dinakar Guthy, Cadence Chung, Rohan Khanna, Osman Ipek, Leslie Ball, Kate Bland, Heather Rocker, Yadunandana Rao, Michael Johnston, Reza Ghanadan, Arindam Mandal , et al. (2 additional authors not shown)

Abstract: We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus openi… ▽ More We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus opening a new venue for high-efficiency HRI data collection and EAI system evaluation. Along with the platform, we introduce a dialog-enabled instruction-following benchmark and provide baseline results for it. We make Alexa Arena publicly available to facilitate research in building generalizable and assistive embodied agents. △ Less

Submitted 7 June, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

arXiv:2302.04867 [pdf, other]

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Authors: Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu

Abstract: Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satis… ▽ More Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., $<$10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC. △ Less

Submitted 17 October, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

Comments: Accepted by NeurIPS 2023. Project page: https://unipc.ivg-research.xyz

arXiv:2302.00665 [pdf, ps, other]

Necessary and sufficient conditions for posterior propriety for generalized linear mixed models

Authors: Yalin Rao, Vivekananda Roy

Abstract: Generalized linear mixed models (GLMMs) are commonly used to analyze correlated discrete or continuous response data. In Bayesian GLMMs, the often-used improper priors may yield undesirable improper posterior distributions. Thus, verifying posterior propriety is crucial for valid applications of Bayesian GLMMs with improper priors. Here, we consider the popular improper uniform prior o… ▽ More Generalized linear mixed models (GLMMs) are commonly used to analyze correlated discrete or continuous response data. In Bayesian GLMMs, the often-used improper priors may yield undesirable improper posterior distributions. Thus, verifying posterior propriety is crucial for valid applications of Bayesian GLMMs with improper priors. Here, we consider the popular improper uniform prior on the regression coefficients and several proper or improper priors, including the widely used gamma and power priors on the variance components of the random effects. We also construct an approximate Jeffreys' prior for objective Bayesian analysis of GLMMs. We derive necessary and sufficient conditions for posterior propriety for Bayesian GLMMs where the response variables have distributions from the exponential family. For the two most widely used GLMMs, namely, the binomial and Poisson GLMMs, we further refine our results by providing easily verifiable conditions compared to the currently available results. Finally, we use examples involving one-way and two-way random effects models to demonstrate the theoretical results derived here. △ Less

Submitted 1 February, 2023; originally announced February 2023.

arXiv:2301.07950 [pdf]

doi 10.1002/adma.202207777

A Monolithic Graphene-Functionalized Microlaser for Multispecies Gas Detection

Authors: Yanhong Guo, Zhaoyu Li, Ning An, Yongzheng Guo, Yuchen Wang, Yusen Yuan, Hao Zhang, Teng Tan, Caihao Wu, Bo Peng, Giancarlo Soavi, Yunjiang Rao, Baicheng Yao

Abstract: Optical microcavity enhanced light-matter interaction offers a powerful tool to develop fast and precise sensing techniques, spurring applications in the detection of biochemical targets ranging from cells, nanoparticles, and large molecules. However, the intrinsic inertness of such pristine microresonators limits their spread in new fields such as gas detection. Here, a functionalized microlaser… ▽ More Optical microcavity enhanced light-matter interaction offers a powerful tool to develop fast and precise sensing techniques, spurring applications in the detection of biochemical targets ranging from cells, nanoparticles, and large molecules. However, the intrinsic inertness of such pristine microresonators limits their spread in new fields such as gas detection. Here, a functionalized microlaser sensor is realized by depositing graphene in an erbium-doped over-modal microsphere. By using a 980 nm pump, multiple laser lines excited in different mode families of the microresonator are co-generated in a single device. The interference between these splitting mode lasers produce beat notes in the electrical domain (0.2-1.1 MHz) with sub-kHz accuracy, thanks to the graphene-induced intracavity backward scattering. This allows for multispecies gas identification from a mixture, and ultrasensitive gas detection down to individual molecule. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Journal ref: Advanced Materials 34 (2022) 2207777

arXiv:2301.04545 [pdf, other]

AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Authors: Xumin Yu, Yongming Rao, Ziyi Wang, Jiwen Lu, Jie Zhou

Abstract: In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the input data to a sequence of point proxies and emp… ▽ More In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the input data to a sequence of point proxies and employ the Transformers for generation. To facilitate Transformers to better leverage the inductive bias about 3D geometric structures of point clouds, we further devise a geometry-aware block that models the local geometric relationships explicitly. The migration of Transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Taking a step towards more complicated and diverse situations, we further propose AdaPoinTr by developing an adaptive query generation mechanism and designing a novel denoising task during completing a point cloud. Coupling these two techniques enables us to train the model efficiently and effectively: we reduce training time (by 15x or more) and improve completion performance (over 20%). We also show our method can be extended to the scene-level point cloud completion scenario by designing a new geometry-enhanced semantic scene completion framework. Extensive experiments on the existing and newly-proposed datasets demonstrate the effectiveness of our method, which attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI, surpassing other work by a large margin and establishing new state-of-the-arts on various benchmarks. Most notably, AdaPoinTr can achieve such promising performance with higher throughputs and fewer FLOPs compared with the previous best methods in practice. The code and datasets are available at https://github.com/yuxumin/PoinTr △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: Extension of our ICCV 2021 work: arXiv:2108.08839 . Code is available at https://github.com/yuxumin/PoinTr

arXiv:2212.14343 [pdf, ps, other]

doi 10.1103/PhysRevResearch.5.043057

Continuous-Variable Nonclassicality Detection under Coarse-Grained Measurement

Authors: Chan Roh, Young-Do Yoon, Jiyong Park, Young-Sik Ra

Abstract: Coarse graining is a common imperfection of realistic quantum measurement, obstructing the direct observation of quantum features. Under highly coarse-grained measurement, we experimentally detect the continuous-variable nonclassicality of both Gaussian and non-Gaussian states. Remarkably, we find that this coarse-grained measurement outperforms the conventional fine-grained measurement for noncla… ▽ More Coarse graining is a common imperfection of realistic quantum measurement, obstructing the direct observation of quantum features. Under highly coarse-grained measurement, we experimentally detect the continuous-variable nonclassicality of both Gaussian and non-Gaussian states. Remarkably, we find that this coarse-grained measurement outperforms the conventional fine-grained measurement for nonclassicality detection: it detects nonclassicality beyond the reach of the variance criterion, and furthermore, it exhibits stronger statistical significance than the high-order moments method. Our work shows the usefulness of coarse-grained measurement by providing a reliable and efficient way of nonclassicality detection for quantum technologies. △ Less

Submitted 29 December, 2022; originally announced December 2022.

Journal ref: Phys. Rev. Research 5, 043057 (2023)

arXiv:2212.04638 [pdf, other]

FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

Authors: Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, Xiu Li

Abstract: With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness ac… ▽ More With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code are publicly available at https://andytang15.github.io/FLAG3D. △ Less

Submitted 19 April, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: Accepted to CVPR2023

arXiv:2211.05872 [pdf, other]

doi 10.1088/1748-0221/18/03/P03024

Intensity Limit in Compact H$^-$ and H$_2^+$ Cyclotrons

Authors: Thomas Planche, Richard A. Baartman, Hui Wen Koay, Yi-Nong Rao, Lige Zhang

Abstract: Compact H$^-$ cyclotrons are used all across the globe to produce medical isotopes. Machines with external ion sources have demonstrated average extracted currents on the order of a few mA, although reported operational numbers are typically around 1\,mA or below. To explore the possibility of extracting even more current from such cyclotrons, it is important to understand the mechanisms that driv… ▽ More Compact H$^-$ cyclotrons are used all across the globe to produce medical isotopes. Machines with external ion sources have demonstrated average extracted currents on the order of a few mA, although reported operational numbers are typically around 1\,mA or below. To explore the possibility of extracting even more current from such cyclotrons, it is important to understand the mechanisms that drive intensity limits and how they scale. In this paper we review some of the key aspects of the beam dynamics in the central region of compact cyclotrons, including rf electric focusing and space charge effects. We derive the scaling of the phase acceptance with the rf gap voltage, harmonic number, etc. We also explore the scaling with different types of ions such as H$^-$, H$_2^+$ and H$_3^+$. We discuss the impact of mechanical erosion of the central region electrodes. Thoughout the paper, we use examples and experimental data from two compact H$^-$ cyclotrons for reference: the TR-30 series and the TRIUMF 500\,MeV machine. △ Less

Submitted 4 January, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

arXiv:2211.01504 [pdf, other]

doi 10.1088/1748-0221/18/03/P03021

Redundant Field Survey Data of Cyclotron with Imperfect Median Plane

Authors: Lige Zhang, Yi-Nong Rao

Abstract: An accurate and detailed field map is important for cyclotron beam dynamics studies. During the long history of cyclotron studies, many techniques have been developed by cyclotron pioneers for the treatment of median plane field map. In this paper, we take the TRIUMF 500 MeV cyclotron as an example to study the asymmetric field resulting from the imperfect median plane symmetry. The ``Gordon appro… ▽ More An accurate and detailed field map is important for cyclotron beam dynamics studies. During the long history of cyclotron studies, many techniques have been developed by cyclotron pioneers for the treatment of median plane field map. In this paper, we take the TRIUMF 500 MeV cyclotron as an example to study the asymmetric field resulting from the imperfect median plane symmetry. The ``Gordon approach'' and a highly accurate compact finite differentiation method are used to investigate the historical field survey data. The redundancy in the survey data is revealed by the expansion method, which also makes it possible to correct the error in the measurement. Finally, both the azimuthal field $B_θ$ and the axial gradient of the axial field $dB_z/dz$ in the median plane are corrected using the radial field map $B_r$. The influence of the correction is examined by recalculating the equilibrium orbit properties of the TRIUMF cyclotron. The result shows significantly increased vertical centering errors of the closed orbits. Further simulation study suggests that these centering errors can be reduced to below 1.5 cm by adjusting the trim coils' $B_r$ field within the output limits of our trim coils' power supplies. The error in the measurement field data may explain why the calculated trim coils' settings during the cyclotron commissioning in 1974 encountered difficulty. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2210.03364 [pdf, other]

doi 10.3847/1538-4357/ac98b4

Soft X-ray Spectral Diagnostics of Multi-thermal Plasma in Solar Flares with Chandrayaan-2 XSM

Authors: N. P. S. Mithun, Santosh V. Vadawale, Giulio Del Zanna, Yamini K. Rao, Bhuwan Joshi, Aveek Sarkar, Biswajit Mondal, P. Janardhan, Anil Bhardwaj, Helen E. Mason

Abstract: Spectroscopic observations in X-ray wavelengths provide excellent diagnostics of the temperature distribution in solar flare plasma. The Solar X-ray Monitor (XSM) onboard the Chandrayaan-2 mission provides broad-band disk integrated soft X-ray solar spectral measurements in the energy range of 1-15 keV with high spectral resolution and time cadence. In this study, we analyse X-ray spectra of three… ▽ More Spectroscopic observations in X-ray wavelengths provide excellent diagnostics of the temperature distribution in solar flare plasma. The Solar X-ray Monitor (XSM) onboard the Chandrayaan-2 mission provides broad-band disk integrated soft X-ray solar spectral measurements in the energy range of 1-15 keV with high spectral resolution and time cadence. In this study, we analyse X-ray spectra of three representative GOES C-class flares obtained with the XSM to investigate the evolution of various plasma parameters during the course of the flares. Using the soft X-ray spectra consisting of the continuum and well-resolved line complexes of major elements like Mg, Si, and Fe, we investigate the validity of the isothermal and multi-thermal assumptions on the high temperature components of the flaring plasma. We show that the soft X-ray spectra during the impulsive phase of the high intensity flares are inconsistent with isothermal models and are best fitted with double peaked differential emission measure distributions where the temperature of the hotter component rises faster than that of the cooler component. The two distinct temperature components observed in DEM models during the impulsive phase of the flares suggest the presence of the directly heated plasma in the corona and evaporated plasma from the chromospheric footpoints. We also find that the abundances of low FIP elements Mg, Si, and Fe reduces from near coronal to near photospheric values during the rising phase of the flare and recovers back to coronal values during decay phase, which is also consistent with the chromospheric evaporation scenario. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted for publication in ApJ

arXiv:2210.01253 [pdf, other]

PLOT: Prompt Learning with Optimal Transport for Vision-Language Models

Authors: Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun Zhang

Abstract: With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly ma… ▽ More With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT. △ Less

Submitted 9 February, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: ICLR 2023, Spotlight

arXiv:2209.05555 [pdf]

An Embedding-Based Grocery Search Model at Instacart

Authors: Yuqing Xie, Taesik Na, Xiao Xiao, Saurav Manchanda, Young Rao, Zhihong Xu, Guanghua Shu, Esther Vasiete, Tejaswi Tenneti, Haixun Wang

Abstract: The key to e-commerce search is how to best utilize the large yet noisy log data. In this paper, we present our embedding-based model for grocery search at Instacart. The system learns query and product representations with a two-tower transformer-based encoder architecture. To tackle the cold-start problem, we focus on content-based features. To train the model efficiently on noisy data, we propo… ▽ More The key to e-commerce search is how to best utilize the large yet noisy log data. In this paper, we present our embedding-based model for grocery search at Instacart. The system learns query and product representations with a two-tower transformer-based encoder architecture. To tackle the cold-start problem, we focus on content-based features. To train the model efficiently on noisy data, we propose a self-adversarial learning method and a cascade training method. AccOn an offline human evaluation dataset, we achieve 10% relative improvement in RECALL@20, and for online A/B testing, we achieve 4.1% cart-adds per search (CAPS) and 1.5% gross merchandise value (GMV) improvement. We describe how we train and deploy the embedding based search model and give a detailed analysis of the effectiveness of our method. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: Accepted by SIGIR eCom, July 15, 2022

arXiv:2208.02812 [pdf, other]

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

Authors: Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu

Abstract: Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited trainin… ▽ More Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https://github.com/wangzy22/P2P. △ Less

Submitted 12 October, 2022; v1 submitted 4 August, 2022; originally announced August 2022.

Comments: Accepted to NeurIPS 2022, project page: https://p2p.ivg-research.xyz

arXiv:2207.14284 [pdf, other]

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Authors: Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser-Nam Lim, Jiwen Lu

Abstract: Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the R… ▽ More Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution ($\textit{g}^\textit{n}$Conv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. $\textit{g}^\textit{n}$Conv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from the effectiveness in visual encoders, we also show $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that $\textit{g}^\textit{n}$Conv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet △ Less

Submitted 11 October, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

Comments: project page: https://hornet.ivg-research.xyz

arXiv:2207.06879 [pdf, ps, other]

doi 10.3847/1538-4357/ac7a9a

Multi-wavelength observations by XSM, Hinode and SDO of an active region. Chemical abundances and temperatures

Authors: G. Del Zanna, B. Mondal, Y. K. Rao, N. P. S. Mithun, S. V. Vadawale, K. K. Reeves, H. E. Mason, A. Sarkar, P. Janardhan, A. Bhardwaj

Abstract: We have reviewed the first year of observations of the Solar X-ray Monitor (XSM) onboard Chandrayaan-2, and the available multi-wavelength observations to complement the XSM data, focusing on Solar Dynamics Observatory AIA and Hinode XRT, EIS observations. XSM has provided disk-integrated solar spectra in the 1--15 keV energy range, observing a large number of microflares. We present an analysis o… ▽ More We have reviewed the first year of observations of the Solar X-ray Monitor (XSM) onboard Chandrayaan-2, and the available multi-wavelength observations to complement the XSM data, focusing on Solar Dynamics Observatory AIA and Hinode XRT, EIS observations. XSM has provided disk-integrated solar spectra in the 1--15 keV energy range, observing a large number of microflares. We present an analysis of multi-wavelength observations of AR 12759 during its disk crossing. We use a new radiometric calibration of EIS to find that the quiescent AR core emission during its disk crossing has a distribution of temperatures and chemical abundances that does not change significantly over time. An analysis of the XSM spectra confirms the EIS results, and shows that the low First Ionization Potential (FIP) elements are enhanced, compared to their photospheric values. The frequent microflares produced by the AR did not affect the abundances of the quiescent AR core. We also present an analysis of one of the flares it produced, SOL2020-04-09T09:32. The XSM analysis indicates isothermal temperatures reaching 6 MK. The lack of very high-T emission is confirmed by AIA. We find excellent agreement between the observed XSM spectrum and the one predicted using an AIA DEM analysis. In contrast, the XRT Al-Poly / Be-thin filter ratio gives lower temperatures for the quiescent and flaring phases. We show that this is due to the sensitivity of this ratio to low temperatures, as the XRT filter ratios predicted with a DEM analysis based on EIS and AIA gives values in good agreement with the observed ones. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: accepted for publication

arXiv:2207.02409 [pdf]

Sub-monolayer Biolasers: Lower Gain, Higher Sensitivity

Authors: C. Gong, X. Yang, S. J. Tang, Q. Q. Zhang, Y. Wang, Y. L. Liu, Y. C. Chen, G. D. Peng, X. Fan, Y. F. Xiao, Y. J. Rao, Y. Gong

Abstract: Biomarker detection is the key to identifying health risks. However, designing sensitive biosensors in a single-use mode for disease diagnosis remains a major challenge. Here, we report sub-monolayer biolasers with remarkable repeatability for ultrasensitive and disposable biomarker detection. The biolaser sensors are designed by employing the telecom optical fibers as distributed optical microcav… ▽ More Biomarker detection is the key to identifying health risks. However, designing sensitive biosensors in a single-use mode for disease diagnosis remains a major challenge. Here, we report sub-monolayer biolasers with remarkable repeatability for ultrasensitive and disposable biomarker detection. The biolaser sensors are designed by employing the telecom optical fibers as distributed optical microcavities and pushing the gain molecules down to the sub-monolayer level. We observe a status transition from the monolayer biolaser to the sub-monolayer biolaser by tuning the specific conjugation. By reducing the fluorophores down to the threshold density (~ 3.2 x 10-13 mol/cm2), we demonstrate an ultimate sensitivity of sub-monolayer biolaser with six orders of magnitude enhancement compared with the monolayer biolasers. We further achieved ultrasensitive immunoassay for Parkinson's disease biomarker, alpha-synuclein, with a lower limit of detection of 0.32 pM in serum. This biosensor with massive fabrication capability at ultralow cost provides a general method for the ultrasensitive disposable biodetection of disease biomarkers. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: 27 pages, 15 figures

MSC Class: 78A70

arXiv:2207.01580 [pdf, other]

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

Authors: Yongming Rao, Zuyan Liu, Wenliang Zhao, Jie Zhou, Jiwen Lu

Abstract: In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively… ▽ More In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks that require structured feature maps by formulating a more generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and using more expressive slow paths to more important locations, we can maintain the structure of feature maps while significantly reducing the overall computations. Extensive experiments demonstrate the effectiveness of our framework on various modern architectures and different visual recognition tasks. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT △ Less

Submitted 2 June, 2023; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: Accepted to T-PAMI. Journal version of our NeurIPS 2021 work: arXiv:2106.02034. Code is available at https://github.com/raoyongming/DynamicViT

arXiv:2206.11228 [pdf, other]

Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Authors: Chong Guo, Michael J. Lee, Guillaume Leclerc, Joel Dapello, Yug Rao, Aleksander Madry, James J. DiCarlo

Abstract: Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that… ▽ More Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that the above-mentioned belief might not be well founded. Specifically, we report that the biological neurons that make up visual systems of primates exhibit susceptibility to adversarial perturbations that is comparable in magnitude to existing (robustly trained) artificial neural networks. △ Less

Submitted 19 June, 2022; originally announced June 2022.

Comments: 10 pages, 6 figures, ICML2022

arXiv:2206.04916 [pdf, other]

PatchComplete: Learning Multi-Resolution Patch Priors for 3D Shape Completion on Unseen Categories

Authors: Yuchen Rao, Yinyu Nie, Angela Dai

Abstract: While 3D shape representations enable powerful reasoning in many visual and perception applications, learning 3D shape priors tends to be constrained to the specific categories trained on, leading to an inefficient learning process, particularly for general applications with unseen categories. Thus, we propose PatchComplete, which learns effective shape priors based on multi-resolution local patch… ▽ More While 3D shape representations enable powerful reasoning in many visual and perception applications, learning 3D shape priors tends to be constrained to the specific categories trained on, leading to an inefficient learning process, particularly for general applications with unseen categories. Thus, we propose PatchComplete, which learns effective shape priors based on multi-resolution local patches, which are often more general than full shapes (e.g., chairs and tables often both share legs) and thus enable geometric reasoning about unseen class categories. To learn these shared substructures, we learn multi-resolution patch priors across all train categories, which are then associated to input partial shape observations by attention across the patch priors, and finally decoded into a complete shape reconstruction. Such patch-based priors avoid overfitting to specific train categories and enable reconstruction on entirely unseen categories at test time. We demonstrate the effectiveness of our approach on synthetic ShapeNet data as well as challenging real-scanned objects from ScanNet, which include noise and clutter, improving over state of the art in novel-category shape completion by 19.3% in chamfer distance on ShapeNet, and 9.0% for ScanNet. △ Less

Submitted 12 October, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: Video link: https://www.youtube.com/watch?v=Ch1rvw2D_Kc ; Project page: https://yuchenrao.github.io/projects/patchComplete/patchComplete.html ; Accepted to NeurIPS'22

arXiv:2206.03385 [pdf]

doi 10.1038/s41467-022-30901-8

Nonlinear co-generation of graphene plasmons for optoelectronic logic operations

Authors: Y. Li, N. An, Z. Lu, Y. Wang, B. Chang, T. Tan, X. Guo, X. Xu, J. He, H. Xia, Z. Wu, Y. Su, Y. Liu, Y. Rao, G. Soavi, B. Yao

Abstract: Surface plasmons in graphene provide a compelling strategy for advanced photonic technologies thanks to their tight confinement, fast response and tunability. Recent advances in the field of all optical generation of graphene plasmons in planar waveguides offer a promising method for high speed signal processing in nanoscale integrated optoelectronic devices. Here, we use two counter propagating f… ▽ More Surface plasmons in graphene provide a compelling strategy for advanced photonic technologies thanks to their tight confinement, fast response and tunability. Recent advances in the field of all optical generation of graphene plasmons in planar waveguides offer a promising method for high speed signal processing in nanoscale integrated optoelectronic devices. Here, we use two counter propagating frequency combs with temporally synchronized pulses to demonstrate deterministic all optical generation and electrical control of multiple plasmon polaritons, excited via difference frequency generation (DFG). Electrical tuning of a hybrid graphene fibre device offers a precise control over the DFG phase matching, leading to tunable responses of the graphene plasmons at different frequencies across a broadband (0 - 50 THz) and provides a powerful tool for high speed logic operations. Our results offer insights for plasmonics on hybrid photonic devices based on layered materials and pave the way to high speed integrated optoelectronic computing circuits. △ Less

Submitted 7 June, 2022; originally announced June 2022.

Journal ref: Nat. Commun. 13, 3138 (2022)

arXiv:2205.13490 [pdf, other]

SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation

Authors: Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou, Jiwen Lu

Abstract: Conventional point cloud semantic segmentation methods usually employ an encoder-decoder architecture, where mid-level features are locally aggregated to extract geometric information. However, the over-reliance on these class-agnostic local geometric representations may raise confusion between local parts from different categories that are similar in appearance or spatially adjacent. To address t… ▽ More Conventional point cloud semantic segmentation methods usually employ an encoder-decoder architecture, where mid-level features are locally aggregated to extract geometric information. However, the over-reliance on these class-agnostic local geometric representations may raise confusion between local parts from different categories that are similar in appearance or spatially adjacent. To address this issue, we argue that mid-level features can be further enhanced with semantic information, and propose semantic-affine transformation that transforms features of mid-level points belonging to different categories with class-specific affine parameters. Based on this technique, we propose SemAffiNet for point cloud semantic segmentation, which utilizes the attention mechanism in the Transformer module to implicitly and explicitly capture global structural knowledge within local parts for overall comprehension of each category. We conduct extensive experiments on the ScanNetV2 and NYUv2 datasets, and evaluate semantic-affine transformation on various 3D point cloud and 2D image segmentation baselines, where both qualitative and quantitative results demonstrate the superiority and generalization ability of our proposed approach. Code is available at https://github.com/wangzy22/SemAffiNet. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted to CVPR 2022

arXiv:2204.03646 [pdf, other]

FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment

Authors: Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, Jiwen Lu

Abstract: Most existing action quality assessment methods rely on the deep features of an entire video to predict the score, which is less reliable due to the non-transparent inference process and poor interpretability. We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable. To… ▽ More Most existing action quality assessment methods rely on the deep features of an entire video to predict the score, which is less reliable due to the non-transparent inference process and poor interpretability. We argue that understanding both high-level semantics and internal temporal structures of actions in competitive sports videos is the key to making predictions accurate and interpretable. Towards this goal, we construct a new fine-grained dataset, called FineDiving, developed on diverse diving events with detailed annotations on action procedures. We also propose a procedure-aware approach for action quality assessment, learned by a new Temporal Segmentation Attention module. Specifically, we propose to parse pairwise query and exemplar action instances into consecutive steps with diverse semantic and temporal correspondences. The procedure-aware cross-attention is proposed to learn embeddings between query and exemplar steps to discover their semantic, spatial, and temporal correspondences, and further serve for fine-grained contrastive regression to derive a reliable scoring mechanism. Extensive experiments demonstrate that our approach achieves substantial improvements over state-of-the-art methods with better interpretability. The dataset and code are available at \url{https://github.com/xujinglin/FineDiving}. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: Computer Vision and Pattern Recognition 2022 (Oral presentation)

arXiv:2204.03636 [pdf, other]

SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Authors: Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yongming Rao, Guan Huang, Jiwen Lu, Jie Zhou

Abstract: Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric constraints enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore… ▽ More Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric constraints enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt the two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view ego-motion consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes. △ Less

Submitted 20 September, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted to CoRL 2022. Project page: https://surrounddepth.ivg-research.xyz Code: https://github.com/weiyithu/SurroundDepth

arXiv:2203.16613 [pdf, other]

doi 10.1063/5.0087730

High thermoelectric performance in metastable phase of silicon: a first-principles study

Authors: Yongchao Rao, C. Y. Zhao, Shenghong Ju

Abstract: In this work, both thermal and electrical transport properties of diamond$-$cubic Si (Si$-$I) and metastable R8 phase of Si (Si$-$XII) are comparatively studied by using first$-$principles calculations combined with Boltzmann transport theory. The metastable Si$-$XII shows one magnitude lower lattice thermal conductivity than stable Si$-$I from 300 to 500~K, attributed from the stronger phonon sca… ▽ More In this work, both thermal and electrical transport properties of diamond$-$cubic Si (Si$-$I) and metastable R8 phase of Si (Si$-$XII) are comparatively studied by using first$-$principles calculations combined with Boltzmann transport theory. The metastable Si$-$XII shows one magnitude lower lattice thermal conductivity than stable Si$-$I from 300 to 500~K, attributed from the stronger phonon scattering in three$-$phonon scattering processes of Si$-$XII. For the electronic transport properties, although Si$-$XII with smaller band gap (0.22 eV) shows lower Seebeck coefficient, the electrical conductivities of anisotropic $n$$-$type Si$-$XII show considerable values along $x$ axis due to the small effective masses of electron along this direction. The peaks of thermoelectric figure of merit ($ZT$) in $n$$-$type Si$-$XII are higher than that of $p$$-$type ones along the same direction. Owing to the lower lattice thermal conductivity and optimistic electrical conductivity, Si$-$XII exhibits larger optimal $ZT$ compared with Si$-$I in both $p$$-$ and $n$$-$type doping. For $n$$-$type Si$-$XII, the optimal $ZT$ values at 300, 400, and 500 K can reach 0.24, 0.43, and 0.63 along $x$ axis at carrier concentration of $2.6\times10^{19}$, $4.1\times10^{19}$, and $4.8\times10^{19}$~cm$^{-3}$, respectively. The reported results elucidate that the metastable Si could be integrated to the thermoelectric power generator. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Journal ref: Applied Physics Letter 120, 163901, 2022

arXiv:2203.14956 [pdf, other]

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

Authors: Yi Wei, Zibu Wei, Yongming Rao, Jiaxin Li, Jie Zhou, Jiwen Lu

Abstract: In this paper, we propose the LiDAR Distillation to bridge the domain gap induced by different LiDAR beams for 3D object detection. In many real-world applications, the LiDAR points used by mass-produced robots and vehicles usually have fewer beams than that in large-scale public datasets. Moreover, as the LiDARs are upgraded to other product models with different beam amount, it becomes challengi… ▽ More In this paper, we propose the LiDAR Distillation to bridge the domain gap induced by different LiDAR beams for 3D object detection. In many real-world applications, the LiDAR points used by mass-produced robots and vehicles usually have fewer beams than that in large-scale public datasets. Moreover, as the LiDARs are upgraded to other product models with different beam amount, it becomes challenging to utilize the labeled data captured by previous versions' high-resolution sensors. Despite the recent progress on domain adaptive 3D detection, most methods struggle to eliminate the beam-induced domain gap. We find that it is essential to align the point cloud density of the source domain with that of the target domain during the training process. Inspired by this discovery, we propose a progressive framework to mitigate the beam-induced domain shift. In each iteration, we first generate low-beam pseudo LiDAR by downsampling the high-beam point clouds. Then the teacher-student framework is employed to distill rich information from the data with more beams. Extensive experiments on Waymo, nuScenes and KITTI datasets with three different LiDAR-based detectors demonstrate the effectiveness of our LiDAR Distillation. Notably, our approach does not increase any additional computation cost for inference. △ Less

Submitted 14 August, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted to ECCV 2022. Code is available at https://github.com/weiyithu/LiDAR-Distillation

arXiv:2203.14101

A Roadmap for Big Model

Authors: Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui , et al. (75 additional authors not shown)

Abstract: With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM… ▽ More With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view. △ Less

Submitted 20 April, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

Comments: This report has been withdrawn by the authors due to critical issues in Section 2.3.1 of Article 2

arXiv:2203.13777 [pdf, other]

Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion

Authors: Tianpei Gu, Guangyi Chen, Junlong Li, Chunze Lin, Yongming Rao, Jie Zhou, Jiwen Lu

Abstract: Human behavior has the nature of indeterminacy, which requires the pedestrian trajectory prediction system to model the multi-modality of future motion states. Unlike existing stochastic trajectory prediction methods which usually use a latent variable to represent multi-modality, we explicitly simulate the process of human motion variation from indeterminate to determinate. In this paper, we pres… ▽ More Human behavior has the nature of indeterminacy, which requires the pedestrian trajectory prediction system to model the multi-modality of future motion states. Unlike existing stochastic trajectory prediction methods which usually use a latent variable to represent multi-modality, we explicitly simulate the process of human motion variation from indeterminate to determinate. In this paper, we present a new framework to formulate the trajectory prediction task as a reverse process of motion indeterminacy diffusion (MID), in which we progressively discard indeterminacy from all the walkable areas until reaching the desired trajectory. This process is learned with a parameterized Markov chain conditioned by the observed trajectories. We can adjust the length of the chain to control the degree of indeterminacy and balance the diversity and determinacy of the predictions. Specifically, we encode the history behavior information and the social interactions as a state embedding and devise a Transformer-based diffusion model to capture the temporal dependencies of trajectories. Extensive experiments on the human trajectory prediction benchmarks including the Stanford Drone and ETH/UCY datasets demonstrate the superiority of our method. Code is available at https://github.com/gutianpei/MID. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR2022

Showing 1–50 of 142 results for author: Rao, Y