Search | arXiv e-print repository

P2P: Part-to-Part Motion Cues Guide a Strong Tracking Framework for LiDAR Point Clouds

Authors: Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He

Abstract: 3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-… ▽ More 3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth explorations on motion paradigm, which proves that (\textbf{i}) it is feasible to directly infer target relative motion from point clouds across consecutive frames; (\textbf{ii}) fine-grained information comparison between consecutive point clouds facilitates target motion modeling. We thereby propose to perform part-to-part motion modeling for consecutive point clouds and introduce a novel tracking framework, termed \textbf{P2P}. The novel framework fuses each corresponding part information between consecutive point clouds, effectively exploring detailed information changes and thus modeling accurate target-related motion cues. Following this framework, we present P2P-point and P2P-voxel models, incorporating implicit and explicit part-to-part motion modeling by point- and voxel-based representation, respectively. Without bells and whistles, P2P-voxel sets a new state-of-the-art performance ($\sim$\textbf{89\%}, \textbf{72\%} and \textbf{63\%} precision on KITTI, NuScenes and Waymo Open Dataset, respectively). Moreover, under the same point-based representation, P2P-point outperforms the previous motion tracker M$^2$Track by \textbf{3.3\%} and \textbf{6.7\%} on the KITTI and NuScenes, while running at a considerably high speed of \textbf{107 Fps} on a single RTX3090 GPU. The source code and pre-trained models are available at \url{https://github.com/haooozi/P2P}. △ Less

Submitted 8 July, 2024; v1 submitted 6 July, 2024; originally announced July 2024.

Comments: The source code and pre-trained models are available at https://github.com/haooozi/P2P

arXiv:2407.04994 [pdf, other]

The Solution for Language-Enhanced Image New Category Discovery

Authors: Haonan Xu, Dian Chao, Xiangyu Wu, Zhonghua Wan, Yang Yang

Abstract: Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training proce… ▽ More Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a dual-adapter module that simultaneously leverages knowledge from the original CLIP and new learning knowledge derived from downstream datasets. Benefiting from the pseudo visual prompts, our method surpasses the state-of-the-art not only on clean annotated text data but also on pseudo text data generated by large language models. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2407.01907 [pdf, other]

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Authors: Hailiang Zhang, Dian Chao, Zhihao Guan, Yang Yang

Abstract: In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot addre… ▽ More In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2404.03528 [pdf, other]

BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering

Authors: Azmine Toushik Wasi, Taki Hasan Rafi, Raima Islam, Dong-Kyu Chae

Abstract: Knowledge Graphs (KGs) have proven essential in information processing and reasoning applications because they link related entities and give context-rich information, supporting efficient information retrieval and knowledge discovery; presenting information flow in a very effective manner. Despite being widely used globally, Bangla is relatively underrepresented in KGs due to a lack of comprehens… ▽ More Knowledge Graphs (KGs) have proven essential in information processing and reasoning applications because they link related entities and give context-rich information, supporting efficient information retrieval and knowledge discovery; presenting information flow in a very effective manner. Despite being widely used globally, Bangla is relatively underrepresented in KGs due to a lack of comprehensive datasets, encoders, NER (named entity recognition) models, POS (part-of-speech) taggers, and lemmatizers, hindering efficient information processing and reasoning applications in the language. Addressing the KG scarcity in Bengali, we propose BanglaAutoKG, a pioneering framework that is able to automatically construct Bengali KGs from any Bangla text. We utilize multilingual LLMs to understand various languages and correlate entities and relations universally. By employing a translation dictionary to identify English equivalents and extracting word features from pre-trained BERT models, we construct the foundational KG. To reduce noise and align word embeddings with our goal, we employ graph-based polynomial filters. Lastly, we implement a GNN-based semantic filter, which elevates contextual understanding and trims unnecessary edges, culminating in the formation of the definitive KG. Empirical findings and case studies demonstrate the universal effectiveness of our model, capable of autonomously constructing semantically enriched KGs from any text. △ Less

Submitted 5 June, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

Comments: 7 pages, 3 figures. Accepted to LREC-COLING 2024. Read in ACL Anthology: https://aclanthology.org/2024.lrec-main.189/

Journal ref: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

arXiv:2404.01104 [pdf, other]

SentiCSE: A Sentiment-aware Contrastive Sentence Embedding Framework with Sentiment-guided Textual Similarity

Authors: Jaemin Kim, Yohan Na, Kangmin Kim, Sang Rak Lee, Dong-Kyu Chae

Abstract: Recently, sentiment-aware pre-trained language models (PLMs) demonstrate impressive results in downstream sentiment analysis tasks. However, they neglect to evaluate the quality of their constructed sentiment representations; they just focus on improving the fine-tuning performance, which overshadows the representation quality. We argue that without guaranteeing the representation quality, their d… ▽ More Recently, sentiment-aware pre-trained language models (PLMs) demonstrate impressive results in downstream sentiment analysis tasks. However, they neglect to evaluate the quality of their constructed sentiment representations; they just focus on improving the fine-tuning performance, which overshadows the representation quality. We argue that without guaranteeing the representation quality, their downstream performance can be highly dependent on the supervision of the fine-tuning data rather than representation quality. This problem would make them difficult to foray into other sentiment-related domains, especially where labeled data is scarce. We first propose Sentiment-guided Textual Similarity (SgTS), a novel metric for evaluating the quality of sentiment representations, which is designed based on the degree of equivalence in sentiment polarity between two sentences. We then propose SentiCSE, a novel Sentiment-aware Contrastive Sentence Embedding framework for constructing sentiment representations via combined word-level and sentence-level objectives, whose quality is guaranteed by SgTS. Qualitative and quantitative comparison with the previous sentiment-aware PLMs shows the superiority of our work. Our code is available at: https://github.com/nayohan/SentiCSE △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 14 pages, 8 figures

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: LREC-COLING2024

arXiv:2403.17639 [pdf, other]

High-Resolution Image Translation Model Based on Grayscale Redefinition

Authors: Xixian Wu, Dian Chao, Yang Yang

Abstract: Image-to-image translation is a technique that focuses on transferring images from one domain to another while maintaining the essential content representations. In recent years, image-to-image translation has gained significant attention and achieved remarkable advancements due to its diverse applications in computer vision and image processing tasks. In this work, we propose an innovative method… ▽ More Image-to-image translation is a technique that focuses on transferring images from one domain to another while maintaining the essential content representations. In recent years, image-to-image translation has gained significant attention and achieved remarkable advancements due to its diverse applications in computer vision and image processing tasks. In this work, we propose an innovative method for image translation between different domains. For high-resolution image translation tasks, we use a grayscale adjustment method to achieve pixel-level translation. For other tasks, we utilize the Pix2PixHD model with a coarse-to-fine generator, multi-scale discriminator, and improved loss to enhance the image translation performance. On the other hand, to tackle the issue of sparse training data, we adopt model weight initialization from other task to optimize the performance of the current task. △ Less

Submitted 1 April, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.17342 [pdf, other]

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

Authors: Dian Chao, Xin Song, Shupeng Zhong, Boyuan Wang, Xiangyu Wu, Chen Zhu, Yang Yang

Abstract: In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all ima… ▽ More In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.17210 [pdf, other]

CADGL: Context-Aware Deep Graph Learning for Predicting Drug-Drug Interactions

Authors: Azmine Toushik Wasi, Taki Hasan Rafi, Raima Islam, Serbetar Karlo, Dong-Kyu Chae

Abstract: Examining Drug-Drug Interactions (DDIs) is a pivotal element in the process of drug development. DDIs occur when one drug's properties are affected by the inclusion of other drugs. Detecting favorable DDIs has the potential to pave the way for creating and advancing innovative medications applicable in practical settings. However, existing DDI prediction models continue to face challenges related… ▽ More Examining Drug-Drug Interactions (DDIs) is a pivotal element in the process of drug development. DDIs occur when one drug's properties are affected by the inclusion of other drugs. Detecting favorable DDIs has the potential to pave the way for creating and advancing innovative medications applicable in practical settings. However, existing DDI prediction models continue to face challenges related to generalization in extreme cases, robust feature extraction, and real-life application possibilities. We aim to address these challenges by leveraging the effectiveness of context-aware deep graph learning by introducing a novel framework named CADGL. Based on a customized variational graph autoencoder (VGAE), we capture critical structural and physio-chemical information using two context preprocessors for feature extraction from two different perspectives: local neighborhood and molecular context, in a heterogeneous graphical structure. Our customized VGAE consists of a graph encoder, a latent information encoder, and an MLP decoder. CADGL surpasses other state-of-the-art DDI prediction models, excelling in predicting clinically valuable novel DDIs, supported by rigorous case studies. △ Less

Submitted 27 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: 8 Pages, 4 Figures; In review

arXiv:2403.12984 [pdf, other]

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

Authors: Azmine Toushik Wasi, Šerbetar Karlo, Raima Islam, Taki Hasan Rafi, Dong-Kyu Chae

Abstract: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classifi… ▽ More Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? Our experiments affirm the possibility with very competitive scores. The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. The data and code are available here: https://github.com/azminewasi/Drug-Classification-NLP. △ Less

Submitted 27 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 7 pages, 2 figures, 5 tables, Accepted (invited to present) to the The Second Tiny Papers Track at ICLR 2024 (https://openreview.net/forum?id=VUYCyH8fCw)

Journal ref: The Second Tiny Papers Track at {ICLR} 2024, Tiny Papers @ {ICLR} 2024, Vienna Austria, May 11, 2024

arXiv:2402.16040 [pdf, other]

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Authors: Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Jeewon Yang, Seunghyun Won, Edward Choi

Abstract: Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, ho… ▽ More Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries. We offer EHRNoteQA in two formats: open-ended and multi-choice question answering, and propose a reliable evaluation method for each. We evaluate 27 LLMs using EHRNoteQA and examine various factors affecting the model performance (e.g., the length and number of discharge summaries). Furthermore, to validate EHRNoteQA as a reliable proxy for expert evaluations in clinical practice, we measure the correlation between the LLM performance on EHRNoteQA, and the LLM performance manually evaluated by clinicians. Results show that LLM performance on EHRNoteQA have higher correlation with clinician-evaluated performance (Spearman: 0.78, Kendall: 0.62) compared to other benchmarks, demonstrating its practical relevance in evaluating LLMs in clinical settings. △ Less

Submitted 27 June, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Comments: Under Review

arXiv:2401.11204 [pdf, other]

Towards Category Unification of 3D Single Object Tracking on Point Clouds

Authors: Jiahao Nie, Zhiwei He, Xudong Lv, Xueyi Zhou, Dong-Kyu Chae, Fei Xie

Abstract: Category-specific models are provenly valuable methods in 3D single object tracking (SOT) regardless of Siamese or motion-centric paradigms. However, such over-specialized model designs incur redundant parameters, thus limiting the broader applicability of 3D SOT task. This paper first introduces unified models that can simultaneously track objects across all categories using a single network with… ▽ More Category-specific models are provenly valuable methods in 3D single object tracking (SOT) regardless of Siamese or motion-centric paradigms. However, such over-specialized model designs incur redundant parameters, thus limiting the broader applicability of 3D SOT task. This paper first introduces unified models that can simultaneously track objects across all categories using a single network with shared model parameters. Specifically, we propose to explicitly encode distinct attributes associated to different object categories, enabling the model to adapt to cross-category data. We find that the attribute variances of point cloud objects primarily occur from the varying size and shape (e.g., large and square vehicles v.s. small and slender humans). Based on this observation, we design a novel point set representation learning network inheriting transformer architecture, termed AdaFormer, which adaptively encodes the dynamically varying shape and size information from cross-category data in a unified manner. We further incorporate the size and shape prior derived from the known template targets into the model's inputs and learning objective, facilitating the learning of unified representation. Equipped with such designs, we construct two category-unified models SiamCUT and MoCUT.Extensive experiments demonstrate that SiamCUT and MoCUT exhibit strong generalization and training stability. Furthermore, our category-unified models outperform the category-specific counterparts by a significant margin (e.g., on KITTI dataset, 12% and 3% performance gains on the Siamese and motion paradigms). Our code will be available. △ Less

Submitted 20 January, 2024; originally announced January 2024.

Comments: Accepted by ICLR2024 (poster)

arXiv:2312.03011 [pdf, other]

InstructBooth: Instruction-following Personalized Text-to-Image Generation

Authors: Daewon Chae, Nokyung Park, Jinkyu Kim, Kimin Lee

Abstract: Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image… ▽ More Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors. Our project page is at https://sites.google.com/view/instructbooth. △ Less

Submitted 15 February, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

arXiv:2310.02692 [pdf, other]

Clustering-based Image-Text Graph Matching for Domain Generalization

Authors: Nokyung Park, Daewon Chae, Jeongyong Shim, Sangpil Kim, Eun-Sol Kim, Jinkyu Kim

Abstract: Learning domain-invariant visual representations is important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problem. However, they use pivot embedding in global manner (i.e.… ▽ More Learning domain-invariant visual representations is important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problem. However, they use pivot embedding in global manner (i.e., aligning an image embedding with sentence-level text embedding), not fully utilizing the semantic cues of given text description. In this work, we advocate for the use of local alignment between image regions and corresponding textual descriptions. To this end, we first represent image and text inputs with graphs. We subsequently cluster nodes in those graphs and match the graph-based image node features into textual graphs. This matching process is conducted globally and locally, tightly aligning visual and textual semantic sub-structures. We experiment with large-scale public datasets, such as CUB-DG and DomainBed, and our model achieves matched or better state-of-the-art performance on these datasets. Our code will be publicly available upon publication. △ Less

Submitted 15 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

arXiv:2306.08402 [pdf, other]

Fairness and Privacy-Preserving in Federated Learning: A Survey

Authors: Taki Hasan Rafi, Faiza Anan Noor, Tahmid Hussain, Dong-Kyu Chae

Abstract: Federated learning (FL) as distributed machine learning has gained popularity as privacy-aware Machine Learning (ML) systems have emerged as a technique that prevents privacy leakage by building a global model and by conducting individualized training of decentralized edge clients on their own private data. The existing works, however, employ privacy mechanisms such as Secure Multiparty Computing… ▽ More Federated learning (FL) as distributed machine learning has gained popularity as privacy-aware Machine Learning (ML) systems have emerged as a technique that prevents privacy leakage by building a global model and by conducting individualized training of decentralized edge clients on their own private data. The existing works, however, employ privacy mechanisms such as Secure Multiparty Computing (SMC), Differential Privacy (DP), etc. Which are immensely susceptible to interference, massive computational overhead, low accuracy, etc. With the increasingly broad deployment of FL systems, it is challenging to ensure fairness and maintain active client participation in FL systems. Very few works ensure reasonably satisfactory performances for the numerous diverse clients and fail to prevent potential bias against particular demographics in FL systems. The current efforts fail to strike a compromise between privacy, fairness, and model performance in FL systems and are vulnerable to a number of additional problems. In this paper, we provide a comprehensive survey stating the basic concepts of FL, the existing privacy challenges, techniques, and relevant works concerning privacy in FL. We also provide an extensive overview of the increasing fairness challenges, existing fairness notions, and the limited works that attempt both privacy and fairness in FL. By comprehensively describing the existing FL systems, we present the potential future directions pertaining to the challenges of privacy-preserving and fairness-aware FL systems. △ Less

Submitted 14 July, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: 23 pages; 2 figures

arXiv:2305.01486 [pdf, other]

ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning

Authors: Azmine Toushik Wasi, Karlo Šerbetar, Raima Islam, Taki Hasan Rafi, Dong-Kyu Chae

Abstract: In this paper, we introduce a framework ARBEx, a novel attentive feature extraction framework driven by Vision Transformer with reliability balancing to cope against poor class distributions, bias, and uncertainty in the facial expression learning (FEL) task. We reinforce several data pre-processing and refinement methods along with a window-based cross-attention ViT to squeeze the best of the dat… ▽ More In this paper, we introduce a framework ARBEx, a novel attentive feature extraction framework driven by Vision Transformer with reliability balancing to cope against poor class distributions, bias, and uncertainty in the facial expression learning (FEL) task. We reinforce several data pre-processing and refinement methods along with a window-based cross-attention ViT to squeeze the best of the data. We also employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions with reliability balancing, which is a strategy that leverages anchor points, attention scores, and confidence values to enhance the resilience of label predictions. To ensure correct label classification and improve the models' discriminative power, we introduce anchor loss, which encourages large margins between anchor points. Additionally, the multi-head self-attention mechanism, which is also trainable, plays an integral role in identifying accurate labels. This approach provides critical elements for improving the reliability of predictions and has a substantial positive effect on final prediction capabilities. Our adaptive model can be integrated with any deep neural network to forestall challenges in various recognition tasks. Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts. △ Less

Submitted 14 July, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: 12 pages, 7 figures. Code: https://github.com/takihasan/ARBEx

arXiv:2303.14787 [pdf, other]

A Generalized Look at Federated Learning: Survey and Perspectives

Authors: Taki Hasan Rafi, Faiza Anan Noor, Tahmid Hussain, Dong-Kyu Chae, Zhaohui Yang

Abstract: Federated learning (FL) refers to a distributed machine learning framework involving learning from several decentralized edge clients without sharing local dataset. This distributed strategy prevents data leakage and enables on-device training as it updates the global model based on the local model updates. Despite offering several advantages, including data privacy and scalability, FL poses chall… ▽ More Federated learning (FL) refers to a distributed machine learning framework involving learning from several decentralized edge clients without sharing local dataset. This distributed strategy prevents data leakage and enables on-device training as it updates the global model based on the local model updates. Despite offering several advantages, including data privacy and scalability, FL poses challenges such as statistical and system heterogeneity of data in federated networks, communication bottlenecks, privacy and security issues. This survey contains a systematic summarization of previous work, studies, and experiments on FL and presents a list of possibilities for FL across a range of applications and use cases. Other than that, various challenges of implementing FL and promising directions revolving around the corresponding challenges are provided. △ Less

Submitted 26 March, 2023; originally announced March 2023.

Comments: 9 pages, 2 figures

arXiv:2209.09475 [pdf, other]

Revisiting Image Pyramid Structure for High Resolution Salient Object Detection

Authors: Taehun Kim, Kunhee Kim, Joonyeong Lee, Dongmin Cha, Jiho Lee, Daijin Kim

Abstract: Salient object detection (SOD) has been in the spotlight recently, yet has been studied less for high-resolution (HR) images. Unfortunately, HR images and their pixel-level annotations are certainly more labor-intensive and time-consuming compared to low-resolution (LR) images and annotations. Therefore, we propose an image pyramid-based SOD framework, Inverse Saliency Pyramid Reconstruction Netwo… ▽ More Salient object detection (SOD) has been in the spotlight recently, yet has been studied less for high-resolution (HR) images. Unfortunately, HR images and their pixel-level annotations are certainly more labor-intensive and time-consuming compared to low-resolution (LR) images and annotations. Therefore, we propose an image pyramid-based SOD framework, Inverse Saliency Pyramid Reconstruction Network (InSPyReNet), for HR prediction without any of HR datasets. We design InSPyReNet to produce a strict image pyramid structure of saliency map, which enables to ensemble multiple results with pyramid-based image blending. For HR prediction, we design a pyramid blending method which synthesizes two different image pyramids from a pair of LR and HR scale from the same image to overcome effective receptive field (ERF) discrepancy. Our extensive evaluations on public LR and HR SOD benchmarks demonstrate that InSPyReNet surpasses the State-of-the-Art (SotA) methods on various SOD metrics and boundary accuracy. △ Less

Submitted 16 November, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

Comments: 27 pages, 15 figures, 7 tables. To appear in the 16th Asian Conference on Computer Vision (ACCV2022), December 4-8, 2022, Macau SAR, China. DOI will be added soon. Results on DIS5K are added in appendices which will not be in the published version

arXiv:2204.09442 [pdf, ps, other]

DAM-GAN : Image Inpainting using Dynamic Attention Map based on Fake Texture Detection

Authors: Dongmin Cha, Daijin Kim

Abstract: Deep neural advancements have recently brought remarkable image synthesis performance to the field of image inpainting. The adaptation of generative adversarial networks (GAN) in particular has accelerated significant progress in high-quality image reconstruction. However, although many notable GAN-based networks have been proposed for image inpainting, still pixel artifacts or color inconsistency… ▽ More Deep neural advancements have recently brought remarkable image synthesis performance to the field of image inpainting. The adaptation of generative adversarial networks (GAN) in particular has accelerated significant progress in high-quality image reconstruction. However, although many notable GAN-based networks have been proposed for image inpainting, still pixel artifacts or color inconsistency occur in synthesized images during the generation process, which are usually called fake textures. To reduce pixel inconsistency disorder resulted from fake textures, we introduce a GAN-based model using dynamic attention map (DAM-GAN). Our proposed DAM-GAN concentrates on detecting fake texture and products dynamic attention maps to diminish pixel inconsistency from the feature maps in the generator. Evaluation results on CelebA-HQ and Places2 datasets with other image inpainting approaches show the superiority of our network. △ Less

Submitted 20 April, 2022; originally announced April 2022.

arXiv:2101.06371 [pdf, other]

NNStreamer: Efficient and Agile Development of On-Device AI Systems

Authors: MyungJoo Ham, Jijoong Moon, Geunsik Lim, Jaeyun Jung, Hyoungjoo Ahn, Wook Song, Sangjung Woo, Parichay Kapoor, Dongju Chae, Gichan Jang, Yongjoo Ahn, Jihoon Lee

Abstract: We propose NNStreamer, a software system that handles neural networks as filters of stream pipelines, applying the stream processing paradigm to deep neural network applications. A new trend with the wide-spread of deep neural network applications is on-device AI. It is to process neural networks on mobile devices or edge/IoT devices instead of cloud servers. Emerging privacy issues, data transmis… ▽ More We propose NNStreamer, a software system that handles neural networks as filters of stream pipelines, applying the stream processing paradigm to deep neural network applications. A new trend with the wide-spread of deep neural network applications is on-device AI. It is to process neural networks on mobile devices or edge/IoT devices instead of cloud servers. Emerging privacy issues, data transmission costs, and operational costs signify the need for on-device AI, especially if we deploy a massive number of devices. NNStreamer efficiently handles neural networks with complex data stream pipelines on devices, significantly improving the overall performance with minimal efforts. Besides, NNStreamer simplifies implementations and allows reusing off-the-shelf media filters directly, which reduces developmental costs significantly. We are already deploying NNStreamer for a wide range of products and platforms, including the Galaxy series and various consumer electronic devices. The experimental results suggest a reduction in developmental costs and enhanced performance of pipeline architectures and NNStreamer. It is an open-source project incubated by Linux Foundation AI, available to the public and applicable to various hardware and software platforms. △ Less

Submitted 15 January, 2021; originally announced January 2021.

Comments: IEEE/ACM ICSE 2021 SEIP (preprint)

arXiv:2003.09085 [pdf, other]

Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network

Authors: Jakaria Rabbi, Nilanjan Ray, Matthias Schubert, Subir Chowdhury, Dennis Chao

Abstract: The detection performance of small objects in remote sensing images is not satisfactory compared to large objects, especially in low-resolution and noisy images. A generative adversarial network (GAN)-based model called enhanced super-resolution GAN (ESRGAN) shows remarkable image enhancement performance, but reconstructed images miss high-frequency edge information. Therefore, object detection pe… ▽ More The detection performance of small objects in remote sensing images is not satisfactory compared to large objects, especially in low-resolution and noisy images. A generative adversarial network (GAN)-based model called enhanced super-resolution GAN (ESRGAN) shows remarkable image enhancement performance, but reconstructed images miss high-frequency edge information. Therefore, object detection performance degrades for small objects on recovered noisy and low-resolution remote sensing images. Inspired by the success of edge enhanced GAN (EEGAN) and ESRGAN, we apply a new edge-enhanced super-resolution GAN (EESRGAN) to improve the image quality of remote sensing images and use different detector networks in an end-to-end manner where detector loss is backpropagated into the EESRGAN to improve the detection performance. We propose an architecture with three components: ESRGAN, Edge Enhancement Network (EEN), and Detection network. We use residual-in-residual dense blocks (RRDB) for both the ESRGAN and EEN, and for the detector network, we use the faster region-based convolutional network (FRCNN) (two-stage detector) and single-shot multi-box detector (SSD) (one stage detector). Extensive experiments on a public (car overhead with context) and a self-assembled (oil and gas storage tank) satellite dataset show superior performance of our method compared to the standalone state-of-the-art object detectors. △ Less

Submitted 28 April, 2020; v1 submitted 19 March, 2020; originally announced March 2020.

Comments: This paper contains 27 pages and accepted for publication in MDPI remote sensing journal. GitHub Repository: https://github.com/Jakaria08/EESRGAN (Implementation)

arXiv:1908.08151 [pdf, other]

Multi-level Graph Drawing using Infomap Clustering

Authors: Seok-Hee Hong, Peter Eades, Marnijati Torkel, Ziyang Wang, David Chae, Sungpack Hong, Daniel Langerenken, Hassan Chafi

Abstract: Infomap clustering finds the community structures that minimize the expected description length of a random walk trajectory; algorithms for infomap clustering run fast in practice for large graphs. In this paper we leverage the effectiveness of Infomap clustering combined with the multi-level graph drawing paradigm. Experiments show that our new Infomap based multi-level algorithm produces good vi… ▽ More Infomap clustering finds the community structures that minimize the expected description length of a random walk trajectory; algorithms for infomap clustering run fast in practice for large graphs. In this paper we leverage the effectiveness of Infomap clustering combined with the multi-level graph drawing paradigm. Experiments show that our new Infomap based multi-level algorithm produces good visualization of large and complex networks, with significant improvement in quality metrics. △ Less

Submitted 21 August, 2019; originally announced August 2019.

Comments: Appears in the Proceedings of the 27th International Symposium on Graph Drawing and Network Visualization (GD 2019)

arXiv:1811.00143 [pdf, other]

Democratizing Production-Scale Distributed Deep Learning

Authors: Minghuang Ma, Hadi Pouransari, Daniel Chao, Saurabh Adya, Santiago Akle Serrano, Yi Qin, Dan Gimnicher, Dominic Walsh

Abstract: The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off… ▽ More The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off scripts and glue code customized for specific problems. To address these restrictions, we introduce \emph{Alchemist} - an internal service built at Apple from the ground up for \emph{easy}, \emph{fast}, and \emph{scalable} distributed training. We discuss its design, implementation, and examples of running different flavors of distributed training. We also present case studies of its internal adoption in the development of autonomous systems, where training times have been reduced by 10x to keep up with the ever-growing data collection. △ Less

Submitted 3 November, 2018; v1 submitted 31 October, 2018; originally announced November 2018.

Showing 1–22 of 22 results for author: Chae, D