-
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Authors:
Kwanyong Park,
Kuniaki Saito,
Donghyun Kim
Abstract:
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabi…
▽ More
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond
Authors:
Marco Comunità,
Zhi Zhong,
Akira Takahashi,
Shiqi Yang,
Mengjie Zhao,
Koichi Saito,
Yukara Ikemiya,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod…
▽ More
Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x faster with a GPU. Next, built upon a latent space of Mel-spectrogram, SpecMaskGIT has a wider range of applications (e.g., the zero-shot bandwidth extension) than similar methods built on the latent wave domain. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope our work inspires the exploration of masked audio modeling toward further diverse scenarios.
△ Less
Submitted 26 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Authors:
Koichi Saito,
Dongjun Kim,
Takashi Shibuya,
Chieh-Hsin Lai,
Zhi Zhong,
Yuhta Takida,
Yuki Mitsufuji
Abstract:
Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error…
▽ More
Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.
△ Less
Submitted 10 June, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding
Authors:
Tatsunori Taniai,
Ryo Igarashi,
Yuta Suzuki,
Naoya Chiba,
Kotaro Saito,
Yoshitaka Ushiku,
Kanta Ono
Abstract:
Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully conne…
▽ More
Predicting physical properties of materials from their crystal structures is a fundamental problem in materials science. In peripheral areas such as the prediction of molecular properties, fully connected attention networks have been shown to be successful. However, unlike these finite atom arrangements, crystal structures are infinitely repeating, periodic arrangements of atoms, whose fully connected attention results in infinitely connected attention. In this work, we show that this infinitely connected attention can lead to a computationally tractable formulation, interpreted as neural potential summation, that performs infinite interatomic potential summations in a deeply learned feature space. We then propose a simple yet effective Transformer-based encoder architecture for crystal structures called Crystalformer. Compared to an existing Transformer-based model, the proposed model requires only 29.4% of the number of parameters, with minimal modifications to the original Transformer architecture. Despite the architectural simplicity, the proposed method outperforms state-of-the-art methods for various property regression tasks on the Materials Project and JARVIS-DFT datasets.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction
Authors:
Kuniaki Saito,
Kihyuk Sohn,
Chen-Yu Lee,
Yoshitaka Ushiku
Abstract:
Large language models require updates to remain up-to-date or adapt to new domains by fine-tuning them with new documents. One key is memorizing the latest information in a way that the memorized information is extractable with a query prompt. However, LLMs suffer from a phenomenon called perplexity curse; despite minimizing document perplexity during fine-tuning, LLMs struggle to extract informat…
▽ More
Large language models require updates to remain up-to-date or adapt to new domains by fine-tuning them with new documents. One key is memorizing the latest information in a way that the memorized information is extractable with a query prompt. However, LLMs suffer from a phenomenon called perplexity curse; despite minimizing document perplexity during fine-tuning, LLMs struggle to extract information through a prompt sentence. In this new knowledge acquisition and extraction, we find a very intriguing fact that LLMs can accurately answer questions about the first sentence, but they struggle to extract information described in the middle or end of the documents used for fine-tuning. Our study suggests that the auto-regressive training causes this issue; each token is prompted by reliance on all previous tokens, which hinders the model from recalling information from training documents by question prompts. To conduct the in-depth study, we publish both synthetic and real datasets, enabling the evaluation of the QA performance w.r.t. the position of the corresponding answer in a document. Our investigation shows that even a large model suffers from the perplexity curse, but regularization such as denoising auto-regressive loss can enhance the information extraction from diverse positions. These findings will be (i) a key to improving knowledge extraction from LLMs and (ii) new elements to discuss the trade-off between RAG and fine-tuning in adapting LLMs to a new domain.
△ Less
Submitted 23 May, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Authors:
Ryota Tanaka,
Taichi Iki,
Kyosuke Nishida,
Kuniko Saito,
Jun Suzuki
Abstract:
We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks an…
▽ More
We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Privacy-Preserving Hierarchical Anonymization Framework over Encrypted Data
Authors:
Jing Jia,
Kenta Saito,
Hiroaki Nishi
Abstract:
Smart cities, which can monitor the real world and provide smart services in a variety of fields, have improved people's living standards as urbanization has accelerated. However, there are security and privacy concerns because smart city applications collect large amounts of privacy-sensitive information from people and their social circles. Anonymization, which generalizes data and reduces data…
▽ More
Smart cities, which can monitor the real world and provide smart services in a variety of fields, have improved people's living standards as urbanization has accelerated. However, there are security and privacy concerns because smart city applications collect large amounts of privacy-sensitive information from people and their social circles. Anonymization, which generalizes data and reduces data uniqueness is an important step in preserving the privacy of sensitive information. However, anonymization methods frequently require large datasets and rely on untrusted third parties to collect and manage data, particularly in a cloud environment. In this case, private data leakage remains a critical issue, discouraging users from sharing their data and impeding the advancement of smart city services. This problem can be solved if the computational entity can perform the anonymization process without obtaining the original plain text. This study proposed a hierarchical k-anonymization framework using homomorphic encryption and secret sharing composed of two types of domains. Different computing methods are selected flexibly, and two domains are connected hierarchically to obtain higher-level anonymization results in an efficient manner. The experimental results show that connecting two domains can accelerate the anonymization process, indicating that the proposed secure hierarchical architecture is practical and efficient.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Verbosity Bias in Preference Labeling by Large Language Models
Authors:
Keita Saito,
Akifumi Wachi,
Koki Wataoka,
Youhei Akimoto
Abstract:
In recent years, Large Language Models (LLMs) have witnessed a remarkable surge in prevalence, altering the landscape of natural language processing and machine learning. One key factor in improving the performance of LLMs is alignment with humans achieved with Reinforcement Learning from Human Feedback (RLHF), as for many LLMs such as GPT-4, Bard, etc. In addition, recent studies are investigatin…
▽ More
In recent years, Large Language Models (LLMs) have witnessed a remarkable surge in prevalence, altering the landscape of natural language processing and machine learning. One key factor in improving the performance of LLMs is alignment with humans achieved with Reinforcement Learning from Human Feedback (RLHF), as for many LLMs such as GPT-4, Bard, etc. In addition, recent studies are investigating the replacement of human feedback with feedback from other LLMs named Reinforcement Learning from AI Feedback (RLAIF). We examine the biases that come along with evaluating LLMs with other LLMs and take a closer look into verbosity bias -- a bias where LLMs sometimes prefer more verbose answers even if they have similar qualities. We see that in our problem setting, GPT-4 prefers longer answers more than humans. We also propose a metric to measure this bias.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Is Ethereum Proof of Stake Sustainable? $-$ Considering from the Perspective of Competition Among Smart Contract Platforms $-$
Authors:
Kenji Saito,
Yutaka Soejima,
Toshihiko Sugiura,
Yukinobu Kitamura,
Mitsuru Iwamura
Abstract:
Since the Merge update upon which Ethereum transitioned to Proof of Stake, it has been touted that it resulted in lower power consumption and increased security. However, even if that is the case, can this state be sustained?
In this paper, we focus on the potential impact of competition with other smart contract platforms on the price of Ethereum's native currency, Ether (ETH), thereby raising…
▽ More
Since the Merge update upon which Ethereum transitioned to Proof of Stake, it has been touted that it resulted in lower power consumption and increased security. However, even if that is the case, can this state be sustained?
In this paper, we focus on the potential impact of competition with other smart contract platforms on the price of Ethereum's native currency, Ether (ETH), thereby raising questions about the safety and sustainability purportedly brought about by the design of Proof of Stake.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance
Authors:
Carlos Hernandez-Olivan,
Koichi Saito,
Naoki Murata,
Chieh-Hsin Lai,
Marco A. Martínez-Ramirez,
Wei-Hsiang Liao,
Yuki Mitsufuji
Abstract:
Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential iss…
▽ More
Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential issues which will degrade current DPS-based methods' performance and introduce the way to mitigate the issues inspired by diverse diffusion guidance techniques including the RePaint (RP) strategy and the Pseudoinverse-Guided Diffusion Models ($Π$GDM). We demonstrate our methods for the vocal declipping and bandwidth extension tasks under various levels of distortion and cutoff frequency, respectively. In both tasks, our methods outperform the current DPS-based music restoration benchmarks. We refer to \url{http://carlosholivan.github.io/demos/audio-restoration-2023.html} for examples of the restored audio samples.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
ERM++: An Improved Baseline for Domain Generalization
Authors:
Piotr Teterwak,
Kuniaki Saito,
Theodoros Tsiligkaridis,
Kate Saenko,
Bryan A. Plummer
Abstract:
Domain Generalization (DG) measures a classifier's ability to generalize to new distributions of data it was not trained on. Recent work has shown that a hyperparameter-tuned Empirical Risk Minimization (ERM) training procedure, that is simply minimizing the empirical risk on the source domains, can outperform most existing DG methods. ERM has achieved such strong results while only tuning hyper-p…
▽ More
Domain Generalization (DG) measures a classifier's ability to generalize to new distributions of data it was not trained on. Recent work has shown that a hyperparameter-tuned Empirical Risk Minimization (ERM) training procedure, that is simply minimizing the empirical risk on the source domains, can outperform most existing DG methods. ERM has achieved such strong results while only tuning hyper-parameters such as learning rate, weight decay, batch size, and dropout. However there are additional hyperparameters which further limit overfitting and catastrophic forgetting. We therefore focus on tuning previously untuned hyper-parameters, including training amount, initialization, and additional regularizers. We call the resulting stronger baseline ERM++. ERM++ improves the performance of DG by over 5% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15% with a ViT-B/16, and outperforms all SOTA methods on DomainBed with both architectures. We also explore the relationship between DG performance and similarity to pre-training data, and find that similarity to pre-training data distributions is an important driver of performance, but that ERM++ with stronger initializations can deliver strong performance even on dissimilar datasets.Code is released at https://github.com/piotr-teterwak/erm_plusplus.
△ Less
Submitted 26 March, 2024; v1 submitted 4 April, 2023;
originally announced April 2023.
-
Mind the Backbone: Minimizing Backbone Distortion for Robust Object Detection
Authors:
Kuniaki Saito,
Donghyun Kim,
Piotr Teterwak,
Rogerio Feris,
Kate Saenko
Abstract:
Building object detectors that are robust to domain shifts is critical for real-world applications. Prior approaches fine-tune a pre-trained backbone and risk overfitting it to in-distribution (ID) data and distorting features useful for out-of-distribution (OOD) generalization. We propose to use Relative Gradient Norm (RGN) as a way to measure the vulnerability of a backbone to feature distortion…
▽ More
Building object detectors that are robust to domain shifts is critical for real-world applications. Prior approaches fine-tune a pre-trained backbone and risk overfitting it to in-distribution (ID) data and distorting features useful for out-of-distribution (OOD) generalization. We propose to use Relative Gradient Norm (RGN) as a way to measure the vulnerability of a backbone to feature distortion, and show that high RGN is indeed correlated with lower OOD performance. Our analysis of RGN yields interesting findings: some backbones lose OOD robustness during fine-tuning, but others gain robustness because their architecture prevents the parameters from changing too much from the initial model. Given these findings, we present recipes to boost OOD robustness for both types of backbones. Specifically, we investigate regularization and architectural choices for minimizing gradient updates so as to prevent the tuned backbone from losing generalizable features. Our proposed techniques complement each other and show substantial improvements over baselines on diverse architectures and datasets. Code is available at https://github.com/VisionLearningGroup/mind_back.
△ Less
Submitted 15 May, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Authors:
Kuniaki Saito,
Kihyuk Sohn,
Xiang Zhang,
Chun-Liang Li,
Chen-Yu Lee,
Kate Saenko,
Tomas Pfister
Abstract:
In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-S…
▽ More
In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval.
△ Less
Submitted 15 May, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration
Authors:
Naoki Murata,
Koichi Saito,
Chieh-Hsin Lai,
Yuhta Takida,
Toshimitsu Uesaka,
Yuki Mitsufuji,
Stefano Ermon
Abstract:
Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measureme…
▽ More
Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measurement operator is unknown. GibbsDDRM constructs a joint distribution of the data, measurements, and linear operator by using a pre-trained diffusion model for the data prior, and it solves the problem by posterior sampling with an efficient variant of a Gibbs sampler. The proposed method is problem-agnostic, meaning that a pre-trained diffusion model can be applied to various inverse problems without fine-tuning. In experiments, it achieved high performance on both blind image deblurring and vocal dereverberation tasks, despite the use of simple generic priors for the underlying linear operators.
△ Less
Submitted 27 June, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images
Authors:
Ryota Tanaka,
Kyosuke Nishida,
Kosuke Nishida,
Taku Hasegawa,
Itsumi Saito,
Kuniko Saito
Abstract:
Visual question answering on document images that contain textual, visual, and layout information, called document VQA, has received much attention recently. Although many datasets have been proposed for developing document VQA systems, most of the existing datasets focus on understanding the content relationships within a single image and not across multiple images. In this study, we propose a ne…
▽ More
Visual question answering on document images that contain textual, visual, and layout information, called document VQA, has received much attention recently. Although many datasets have been proposed for developing document VQA systems, most of the existing datasets focus on understanding the content relationships within a single image and not across multiple images. In this study, we propose a new multi-image document VQA dataset, SlideVQA, containing 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions about a slide deck. SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning, and also provides annotated arithmetic expressions of numerical answers for enhancing the ability of numerical reasoning. Moreover, we developed a new end-to-end document VQA model that treats evidence selection and question answering in a unified sequence-to-sequence format. Experiments on SlideVQA show that our model outperformed existing state-of-the-art QA models, but that it still has a large gap behind human performance. We believe that our dataset will facilitate research on document VQA.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Neural Structure Fields with Application to Crystal Structure Autoencoders
Authors:
Naoya Chiba,
Yuta Suzuki,
Tatsunori Taniai,
Ryo Igarashi,
Yoshitaka Ushiku,
Kotaro Saito,
Kanta Ono
Abstract:
Representing crystal structures of materials to facilitate determining them via neural networks is crucial for enabling machine-learning applications involving crystal structure estimation. Among these applications, the inverse design of materials can contribute to explore materials with desired properties without relying on luck or serendipity. We propose neural structure fields (NeSF) as an accu…
▽ More
Representing crystal structures of materials to facilitate determining them via neural networks is crucial for enabling machine-learning applications involving crystal structure estimation. Among these applications, the inverse design of materials can contribute to explore materials with desired properties without relying on luck or serendipity. We propose neural structure fields (NeSF) as an accurate and practical approach for representing crystal structures using neural networks. Inspired by the concepts of vector fields in physics and implicit neural representations in computer vision, the proposed NeSF considers a crystal structure as a continuous field rather than as a discrete set of atoms. Unlike existing grid-based discretized spatial representations, the NeSF overcomes the tradeoff between spatial resolution and computational complexity and can represent any crystal structure. We propose an autoencoder of crystal structures that can recover various crystal structures, such as those of perovskite structure materials and cuprate superconductors. Extensive quantitative results demonstrate the superior performance of the NeSF compared with the existing grid-based approach.
△ Less
Submitted 13 December, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Unsupervised vocal dereverberation with diffusion-based generative models
Authors:
Koichi Saito,
Naoki Murata,
Toshimitsu Uesaka,
Chieh-Hsin Lai,
Yuhta Takida,
Takao Fukui,
Yuki Mitsufuji
Abstract:
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they r…
▽ More
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse and numerous pairs of reverberant observations and retrieved data for training in order to be generalizable to unseen observations during inference. To resolve these problems, we propose an unsupervised method that can remove a general kind of artificial reverb for music without requiring pairs of data for training. The proposed method is based on diffusion models, where it initializes the unknown reverberation operator with a conventional signal processing technique and simultaneously refines the estimate with the help of diffusion models. We show through objective and perceptual evaluations that our method outperforms the current leading vocal dereverberation benchmarks.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Prefix Conditioning Unifies Language and Label Supervision
Authors:
Kuniaki Saito,
Kihyuk Sohn,
Xiang Zhang,
Chun-Liang Li,
Chen-Yu Lee,
Kate Saenko,
Tomas Pfister
Abstract:
Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated str…
▽ More
Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated strong performance on few- and zero-shot recognition tasks. When naively unifying image-classification and -caption dataset, we show that such dataset biases negatively affect pre-training by reducing the generalizability of learned representations and thus jeopardizing zero-shot performance since the unification can tailor the model for the classification dataset, making it vulnerable to the distribution shift from the dataset. In this work, we address the problem by disentangling the dataset bias using prefix tokens that inform a language encoder of the type of the input dataset (e.g., image-classification or caption) at training time. This approach allows the language encoder to share the knowledge from two datasets as well as switch the mode of feature extraction, i.e., image-classification dataset or image-caption dataset tailored mode, where we use image-caption mode in the zero-shot evaluation. Our method is generic and can be easily integrated into existing VL pre-training objectives such as CLIP or UniCL. In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
△ Less
Submitted 15 May, 2023; v1 submitted 2 June, 2022;
originally announced June 2022.
-
Fabchain: Managing Audit-able 3D Print Job over Blockchain
Authors:
Ryosuke Abe,
Shigeya Suzuki,
Kenji Saito,
Hiroya Tanaka,
Osamu Nakamura,
Jun Murai
Abstract:
Improvements in fabrication devices such as 3D printers are becoming possible for personal fabrication to freely fabricate any products. To clarify who is liable for the product, the fabricator should keep the fabrication history in an immutable and sustainably accessible manner. In this paper, we propose a new scheme, "Fabchain," that can record the fabrication history in such a manner. By utiliz…
▽ More
Improvements in fabrication devices such as 3D printers are becoming possible for personal fabrication to freely fabricate any products. To clarify who is liable for the product, the fabricator should keep the fabrication history in an immutable and sustainably accessible manner. In this paper, we propose a new scheme, "Fabchain," that can record the fabrication history in such a manner. By utilizing a scheme that employs a blockchain as an audit-able communication channel, Fabchain manages print jobs for the fabricator's 3D printer over the blockchain, while maintaining a history of a print job. We implemented Fabchain on Ethereum and evaluated the performance for recording a print job. Our results demonstrate that Fabchain can complete communication of a print job sequence in less than 1 minute on the Ethereum test network. We conclude that Fabchain can manage a print job in a reasonable duration for 3D printing, while satisfying the requirements for immutability and sustainability.
△ Less
Submitted 6 March, 2022;
originally announced March 2022.
-
Learning to Detect Every Thing in an Open World
Authors:
Kuniaki Saito,
Ping Hu,
Trevor Darrell,
Kate Saenko
Abstract:
Many open-world applications require the detection of novel objects, yet state-of-the-art object detection and instance segmentation networks do not excel at this task. The key issue lies in their assumption that regions without any annotations should be suppressed as negatives, which teaches the model to treat the unannotated objects as background. To address this issue, we propose a simple yet s…
▽ More
Many open-world applications require the detection of novel objects, yet state-of-the-art object detection and instance segmentation networks do not excel at this task. The key issue lies in their assumption that regions without any annotations should be suppressed as negatives, which teaches the model to treat the unannotated objects as background. To address this issue, we propose a simple yet surprisingly powerful data augmentation and training scheme we call Learning to Detect Every Thing (LDET). To avoid suppressing hidden objects, background objects that are visible but unlabeled, we paste annotated objects on a background image sampled from a small region of the original image. Since training solely on such synthetically-augmented images suffers from domain shift, we decouple the training into two parts: 1) training the region classification and regression head on augmented images, and 2)~training the mask heads on original images. In this way, a model does not learn to classify hidden objects as background while generalizing well to real images. LDET leads to significant improvements on many datasets in the open-world instance segmentation task, outperforming baselines on cross-category generalization on COCO, as well as cross-dataset evaluation on UVO and Cityscapes.
△ Less
Submitted 12 April, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
Construction for both self-dual codes and LCD codes
Authors:
Keita Ishizuka,
Ken Saito
Abstract:
From a given $[n, k]$ code $C$, we give a method for constructing many $[n, k]$ codes $C'$ such that the hull dimensions of $C$ and $C'$ are identical. This method can be applied to constructions of both self-dual codes and linear complementary dual codes (LCD codes for short). Using the method, we construct 661 new inequivalent extremal doubly even $[56, 28, 12]$ codes. Furthermore, constructing…
▽ More
From a given $[n, k]$ code $C$, we give a method for constructing many $[n, k]$ codes $C'$ such that the hull dimensions of $C$ and $C'$ are identical. This method can be applied to constructions of both self-dual codes and linear complementary dual codes (LCD codes for short). Using the method, we construct 661 new inequivalent extremal doubly even $[56, 28, 12]$ codes. Furthermore, constructing LCD codes by the method, we improve some of the previously known lower bounds on the largest minimum weights of binary LCD codes of length $n=26,28 \le n \le 40$.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.
-
Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density
Authors:
Kuniaki Saito,
Donghyun Kim,
Piotr Teterwak,
Stan Sclaroff,
Trevor Darrell,
Kate Saenko
Abstract:
Unsupervised domain adaptation (UDA) methods can dramatically improve generalization on unlabeled target domains. However, optimal hyper-parameter selection is critical to achieving high accuracy and avoiding negative transfer. Supervised hyper-parameter validation is not possible without labeled target data, which raises the question: How can we validate unsupervised adaptation techniques in a re…
▽ More
Unsupervised domain adaptation (UDA) methods can dramatically improve generalization on unlabeled target domains. However, optimal hyper-parameter selection is critical to achieving high accuracy and avoiding negative transfer. Supervised hyper-parameter validation is not possible without labeled target data, which raises the question: How can we validate unsupervised adaptation techniques in a realistic way? We first empirically analyze existing criteria and demonstrate that they are not very effective for tuning hyper-parameters. Intuitively, a well-trained source classifier should embed target samples of the same class nearby, forming dense neighborhoods in feature space. Based on this assumption, we propose a novel unsupervised validation criterion that measures the density of soft neighborhoods by computing the entropy of the similarity distribution between points. Our criterion is simpler than competing validation methods, yet more effective; it can tune hyper-parameters and the number of training iterations in both image classification and semantic segmentation models. The code used for the paper will be available at \url{https://github.com/VisionLearningGroup/SND}.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
VisDA-2021 Competition Universal Domain Adaptation to Improve Performance on Out-of-Distribution Data
Authors:
Dina Bashkirova,
Dan Hendrycks,
Donghyun Kim,
Samarth Mishra,
Kate Saenko,
Kuniaki Saito,
Piotr Teterwak,
Ben Usman
Abstract:
Progress in machine learning is typically measured by training and testing a model on the same distribution of data, i.e., the same domain. This over-estimates future accuracy on out-of-distribution data. The Visual Domain Adaptation (VisDA) 2021 competition tests models' ability to adapt to novel test distributions and handle distributional shift. We set up unsupervised domain adaptation challeng…
▽ More
Progress in machine learning is typically measured by training and testing a model on the same distribution of data, i.e., the same domain. This over-estimates future accuracy on out-of-distribution data. The Visual Domain Adaptation (VisDA) 2021 competition tests models' ability to adapt to novel test distributions and handle distributional shift. We set up unsupervised domain adaptation challenges for image classifiers and will evaluate adaptation to novel viewpoints, backgrounds, modalities and degradation in quality. Our challenge draws on large-scale publicly available datasets but constructs the evaluation across domains, rather that the traditional in-domain bench-marking. Furthermore, we focus on the difficult "universal" setting where, in addition to input distribution drift, methods may encounter missing and/or novel classes in the target dataset. Performance will be measured using a rigorous protocol, comparing to state-of-the-art domain adaptation methods with the help of established metrics. We believe that the competition will encourage further improvement in machine learning methods' ability to handle realistic data in many deployment scenarios.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers
Authors:
Kuniaki Saito,
Donghyun Kim,
Kate Saenko
Abstract:
Semi-supervised learning (SSL) is an effective means to leverage unlabeled data to improve a model's performance. Typical SSL methods like FixMatch assume that labeled and unlabeled data share the same label space. However, in practice, unlabeled data can contain categories unseen in the labeled set, i.e., outliers, which can significantly harm the performance of SSL algorithms. To address this pr…
▽ More
Semi-supervised learning (SSL) is an effective means to leverage unlabeled data to improve a model's performance. Typical SSL methods like FixMatch assume that labeled and unlabeled data share the same label space. However, in practice, unlabeled data can contain categories unseen in the labeled set, i.e., outliers, which can significantly harm the performance of SSL algorithms. To address this problem, we propose a novel Open-set Semi-Supervised Learning (OSSL) approach called OpenMatch. Learning representations of inliers while rejecting outliers is essential for the success of OSSL. To this end, OpenMatch unifies FixMatch with novelty detection based on one-vs-all (OVA) classifiers. The OVA-classifier outputs the confidence score of a sample being an inlier, providing a threshold to detect outliers. Another key contribution is an open-set soft-consistency regularization loss, which enhances the smoothness of the OVA-classifier with respect to input transformations and greatly improves outlier detection. OpenMatch achieves state-of-the-art performance on three datasets, and even outperforms a fully supervised model in detecting outliers unseen in unlabeled data on CIFAR10.
△ Less
Submitted 24 August, 2021; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Training Speech Enhancement Systems with Noisy Speech Datasets
Authors:
Koichi Saito,
Stefan Uhlich,
Giorgio Fabbro,
Yuki Mitsufuji
Abstract:
Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward -…
▽ More
Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward - especially, if only considering publicly available datasets. At the same time, a lot of material for automatic speech recognition (ASR) with the desired acoustic/speaker/sampling rate characteristics is publicly available except being clean, i.e., it also contains background noise as this is even often desired in order to have ASR systems that are noise-robust. Hence, using such data to train SE systems is not straightforward. In this paper, we propose two improvements to train SE systems on noisy speech data. First, we propose several modifications of the loss functions, which make them robust against noisy speech targets. In particular, computing the median over the sample axis before averaging over time-frequency bins allows to use such data. Furthermore, we propose a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios. For our experiments, we use the Mozilla Common Voice dataset and we show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way. Similarly, for MixIT we can see an improvement of up to 0.27 in PESQ when using our proposed noise augmentation.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method
Authors:
Koichi Saito,
Tomohiko Nakamura,
Kohei Yatabe,
Yuma Koizumi,
Hiroshi Saruwatari
Abstract:
Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampli…
▽ More
Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampling frequencies specified in the target applications. However, conventional models based on deep neural networks (DNNs) are trained only at the sampling frequency specified by the training data, and there are no guarantees that they work with unseen sampling frequencies. In this paper, we propose a convolution layer capable of handling arbitrary sampling frequencies by a single DNN. Through music source separation experiments, we show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
△ Less
Submitted 9 May, 2021;
originally announced May 2021.
-
On the existence of quaternary Hermitian LCD codes with Hermitian dual distance $1$
Authors:
Keita Ishizuka,
Ken Saito
Abstract:
For $k \ge 2$ and a positive integer $d_0$, we show that if there exists no quaternary Hermitian linear complementary dual $[n,k,d]$ code with $d \ge d_0$ and Hermitian dual distance greater than or equal to $2$, then there exists no quaternary Hermitian linear complementary dual $[n,k,d]$ code with $d \ge d_0$ and Hermitian dual distance $1$. As a consequence, we generalize a result by Araya, Har…
▽ More
For $k \ge 2$ and a positive integer $d_0$, we show that if there exists no quaternary Hermitian linear complementary dual $[n,k,d]$ code with $d \ge d_0$ and Hermitian dual distance greater than or equal to $2$, then there exists no quaternary Hermitian linear complementary dual $[n,k,d]$ code with $d \ge d_0$ and Hermitian dual distance $1$. As a consequence, we generalize a result by Araya, Harada and Saito on the nonexistence of some quaternary Hermitian linear complementary dual codes.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
OVANet: One-vs-All Network for Universal Domain Adaptation
Authors:
Kuniaki Saito,
Kate Saenko
Abstract:
Universal Domain Adaptation (UNDA) aims to handle both domain-shift and category-shift between two datasets, where the main challenge is to transfer knowledge while rejecting unknown classes which are absent in the labeled source data but present in the unlabeled target data. Existing methods manually set a threshold to reject unknown samples based on validation or a pre-defined ratio of unknown s…
▽ More
Universal Domain Adaptation (UNDA) aims to handle both domain-shift and category-shift between two datasets, where the main challenge is to transfer knowledge while rejecting unknown classes which are absent in the labeled source data but present in the unlabeled target data. Existing methods manually set a threshold to reject unknown samples based on validation or a pre-defined ratio of unknown samples, but this strategy is not practical. In this paper, we propose a method to learn the threshold using source samples and to adapt it to the target domain. Our idea is that a minimum inter-class distance in the source domain should be a good threshold to decide between known or unknown in the target. To learn the inter-and intra-class distance, we propose to train a one-vs-all classifier for each class using labeled source data. Then, we adapt the open-set classifier to the target domain by minimizing class entropy. The resulting framework is the simplest of all baselines of UNDA and is insensitive to the value of a hyper-parameter yet outperforms baselines with a large margin.
△ Less
Submitted 24 August, 2021; v1 submitted 7 April, 2021;
originally announced April 2021.
-
Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data
Authors:
Kazuo Aoyama,
Kazumi Saito
Abstract:
This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our pro…
▽ More
This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions rather than less similarity calculations.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Privacy-Preserving Infection Exposure Notification without Trust in Third Parties
Authors:
Kenji Saito,
Mitsuru Iwamura
Abstract:
In response to the COVID-19 pandemic, Bluetooth-based contact tracing has been deployed in many countries with the help of the developers of smartphone operating systems that provide APIs for privacy-preserving exposure notification. However, it has been assumed by the design that the OS developers, smartphone vendors, or governments will not violate people's privacy. We propose a privacy-preservi…
▽ More
In response to the COVID-19 pandemic, Bluetooth-based contact tracing has been deployed in many countries with the help of the developers of smartphone operating systems that provide APIs for privacy-preserving exposure notification. However, it has been assumed by the design that the OS developers, smartphone vendors, or governments will not violate people's privacy. We propose a privacy-preserving exposure notification under situations where none of the middle entities can be trusted. We believe that it can be achieved with small changes to the existing mechanism: random numbers are generated on the application side instead of the OS, and the positive test results are reported to a public ledger (e.g. blockchain) rather than to a government server, with endorsements from the medical institutes with blind signatures. We also discuss how to incentivize the peer-to-peer maintenance of the public ledger if it should be newly built. We show that the level of verifiability is much higher with our proposed design if a consumer group were to verify the privacy protections of the deployed systems. We believe that this will allow for safer contact tracing, and contribute to healthier lifestyles for citizens who may want to or have to go out under pandemic situations.
△ Less
Submitted 13 March, 2021;
originally announced March 2021.
-
Lightweight Selective Disclosure for Verifiable Documents on Blockchain
Authors:
Kenji Saito,
Satoki Watanabe
Abstract:
To achieve lightweight selective disclosure for protecting privacy of document holders, we propose an XML format for documents that can hide arbitrary elements using a cryptographic hash function and salts, which allows to be partially digitally signed and efficiently verified, as well as a JSON format that can be converted to such XML. The documents can be efficiently proven to exist by represent…
▽ More
To achieve lightweight selective disclosure for protecting privacy of document holders, we propose an XML format for documents that can hide arbitrary elements using a cryptographic hash function and salts, which allows to be partially digitally signed and efficiently verified, as well as a JSON format that can be converted to such XML. The documents can be efficiently proven to exist by representing multiple such structures as a Merkle tree and storing its root in blockchain.
We show that our proposal has advantages over known methods that represent the document itself as a Merkle tree and partially hide it.
△ Less
Submitted 9 October, 2021; v1 submitted 13 March, 2021;
originally announced March 2021.
-
Requirement Analyses and Evaluations of Blockchain Platforms per Possible Use Cases
Authors:
Kenji Saito,
Akimitsu Shiseki,
Mitsuyasu Takada,
Hiroki Yamamoto,
Masaaki Saitoh,
Hiroaki Ohkawa,
Hirofumi Andou,
Naotake Miyamoto,
Kazuaki Yamakawa,
Kiyoshi Kurakawa,
Tomohiro Yabushita,
Yuji Yamada,
Go Masuda,
Kazuyuki Masuda
Abstract:
It is said that blockchain will contribute to the digital transformation of society in a wide range of ways, from the management of public and private documents to the traceability in various industries, as well as digital currencies. A number of so-called blockchain platforms have been developed, and experiments and applications have been carried out on them. But are these platforms really conduc…
▽ More
It is said that blockchain will contribute to the digital transformation of society in a wide range of ways, from the management of public and private documents to the traceability in various industries, as well as digital currencies. A number of so-called blockchain platforms have been developed, and experiments and applications have been carried out on them. But are these platforms really conducive to practical use of the blockchain concept?
To answer the question, we need to better understand what the technology called blockchain really is. We need to sort out the confusion we see in understanding what blockchain was invented for and what it means. We also need to clarify the structure of its applications.
This document provides a generic model of understanding blockchain and its applications. We introduce design patterns to classify the platforms. We categorize possible use cases by identifying the structure among applications, and organize the functional, performance, operational and legal requirements for each such case.
Based on the categorization and criteria, we evaluated and compared the following platforms: Hyperledger Fabric, Hyperledger Iroha, Hyperledger Indy, Ethereum, Quorum/Hyperledger Besu, Ethereum 2.0, Polkadot, Corda and BBc-1. We have tried to be fair in our evaluations and comparisons, but we also expect to provoke discussion.
The intended readers for this document is anyone involved in development of application systems who wants to understand blockchain and their platforms, including non-engineers and non-technologists. The assessments in this document will allow readers to understand the technological requirements for the blockchain platforms, to question existing technologies, and to choose the appropriate platforms for the applications they envision. The comparisons hopefully will also be useful as a guide for designing new technologies.
△ Less
Submitted 4 March, 2021;
originally announced March 2021.
-
Proof of Authenticity of Logistics Information with Passive RFID Tags and Blockchain
Authors:
Hiroshi Watanabe,
Kenji Saito,
Satoshi Miyazaki,
Toshiharu Okada,
Hiroyuki Fukuyama,
Tsuneo Kato,
Katsuo Taniguchi
Abstract:
In tracing the (robotically automated) logistics of large quantities of goods, inexpensive passive RFID tags are preferred for cost reasons. Accordingly, security between such tags and readers have primarily been studied among many issues of RFID. However, the authenticity of data cannot be guaranteed if logistics services can give false information. Although the use of blockchain is often discuss…
▽ More
In tracing the (robotically automated) logistics of large quantities of goods, inexpensive passive RFID tags are preferred for cost reasons. Accordingly, security between such tags and readers have primarily been studied among many issues of RFID. However, the authenticity of data cannot be guaranteed if logistics services can give false information. Although the use of blockchain is often discussed, it is simply a recording system, so there is a risk that false records may be written to it.
As a solution, we propose a design in which a digitally signing, location-constrained and tamper-evident reader atomically writes an evidence to blockchain along with its reading and writing a tag.
By semi-formal modeling, we confirmed that the confidentiality and integrity of the information can be maintained throughout the system, and digitally signed data can be verified later despite possible compromise of private keys or signature algorithms, or expiration of public key certificates. We also introduce a prototype design to show that our proposal is viable.
This makes it possible to trace authentic logistics information using inexpensive passive RFID tags. Furthermore, by abstracting the reader/writer as a sensor/actuator, this model can be extended to IoT in general.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
Self-supervised Visual Attribute Learning for Fashion Compatibility
Authors:
Donghyun Kim,
Kuniaki Saito,
Samarth Mishra,
Stan Sclaroff,
Kate Saenko,
Bryan A Plummer
Abstract:
Many self-supervised learning (SSL) methods have been successful in learning semantically meaningful visual representations by solving pretext tasks. However, prior work in SSL focuses on tasks like object recognition or detection, which aim to learn object shapes and assume that the features should be invariant to concepts like colors and textures. Thus, these SSL methods perform poorly on downst…
▽ More
Many self-supervised learning (SSL) methods have been successful in learning semantically meaningful visual representations by solving pretext tasks. However, prior work in SSL focuses on tasks like object recognition or detection, which aim to learn object shapes and assume that the features should be invariant to concepts like colors and textures. Thus, these SSL methods perform poorly on downstream tasks where these concepts provide critical information. In this paper, we present an SSL framework that enables us to learn color and texture-aware features without requiring any labels during training. Our approach consists of three self-supervised tasks designed to capture different concepts that are neglected in prior work that we can select from depending on the needs of our downstream tasks. Our tasks include learning to predict color histograms and discriminate shapeless local patches and textures from each instance. We evaluate our approach on fashion compatibility using Polyvore Outfits and In-Shop Clothing Retrieval using Deepfashion, improving upon prior SSL methods by 9.5-16%, and even outperforming some supervised approaches on Polyvore Outfits despite using no labels. We also show that our approach can be used for transfer learning, demonstrating that we can train on one dataset while achieving high performance on a different dataset.
△ Less
Submitted 11 August, 2021; v1 submitted 1 August, 2020;
originally announced August 2020.
-
COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder
Authors:
Kuniaki Saito,
Kate Saenko,
Ming-Yu Liu
Abstract:
Unsupervised image-to-image translation intends to learn a mapping of an image in a given domain to an analogous image in a different domain, without explicit supervision of the mapping. Few-shot unsupervised image-to-image translation further attempts to generalize the model to an unseen domain by leveraging example images of the unseen domain provided at inference time. While remarkably successf…
▽ More
Unsupervised image-to-image translation intends to learn a mapping of an image in a given domain to an analogous image in a different domain, without explicit supervision of the mapping. Few-shot unsupervised image-to-image translation further attempts to generalize the model to an unseen domain by leveraging example images of the unseen domain provided at inference time. While remarkably successful, existing few-shot image-to-image translation models find it difficult to preserve the structure of the input image while emulating the appearance of the unseen domain, which we refer to as the content loss problem. This is particularly severe when the poses of the objects in the input and example images are very different. To address the issue, we propose a new few-shot image translation model, COCO-FUNIT, which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. Through extensive experimental validations with comparison to the state-of-the-art, our model shows effectiveness in addressing the content loss problem. For code and pretrained models, please check out https://nvlabs.github.io/COCO-FUNIT/ .
△ Less
Submitted 28 July, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Cross-domain Self-supervised Learning for Domain Adaptation with Few Source Labels
Authors:
Donghyun Kim,
Kuniaki Saito,
Tae-Hyun Oh,
Bryan A. Plummer,
Stan Sclaroff,
Kate Saenko
Abstract:
Existing unsupervised domain adaptation methods aim to transfer knowledge from a label-rich source domain to an unlabeled target domain. However, obtaining labels for some source domains may be very expensive, making complete labeling as used in prior work impractical. In this work, we investigate a new domain adaptation scenario with sparsely labeled source data, where only a few examples in the…
▽ More
Existing unsupervised domain adaptation methods aim to transfer knowledge from a label-rich source domain to an unlabeled target domain. However, obtaining labels for some source domains may be very expensive, making complete labeling as used in prior work impractical. In this work, we investigate a new domain adaptation scenario with sparsely labeled source data, where only a few examples in the source domain have been labeled, while the target domain is unlabeled. We show that when labeled source examples are limited, existing methods often fail to learn discriminative features applicable for both source and target domains. We propose a novel Cross-Domain Self-supervised (CDS) learning approach for domain adaptation, which learns features that are not only domain-invariant but also class-discriminative. Our self-supervised learning method captures apparent visual similarity with in-domain self-supervision in a domain adaptive manner and performs cross-domain feature matching with across-domain self-supervision. In extensive experiments with three standard benchmark datasets, our method significantly boosts performance of target accuracy in the new target domain with few source labels and is even helpful on classical domain adaptation scenarios.
△ Less
Submitted 18 March, 2020;
originally announced March 2020.
-
Inverted-File k-Means Clustering: Performance Analysis
Authors:
Kazuo Aoyama,
Kazumi Saito,
Tetsuo Ikeda
Abstract:
This paper presents an inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes. Given such a data set, IVF efficiently works at high-speed and with low memory consumption, which keeps the same solution as a standard Lloyd's algorithm. The high performance arises from two distinct data representations. One is a sparse expression…
▽ More
This paper presents an inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes. Given such a data set, IVF efficiently works at high-speed and with low memory consumption, which keeps the same solution as a standard Lloyd's algorithm. The high performance arises from two distinct data representations. One is a sparse expression for both the object and mean feature vectors. The other is an inverted-file data structure for a set of the mean feature vectors. To confirm the effect of these representations, we design three algorithms using distinct data structures and expressions for comparison. We experimentally demonstrate that IVF achieves better performance than the designed algorithms when they are applied to large-scale real document data sets in a modern computer system equipped with superscalar out-of-order processors and a deep hierarchical memory system. We also introduce a simple yet practical clock-cycle per instruction (CPI) model for speed-performance analysis. Analytical results reveal that IVF suppresses three performance degradation factors: the numbers of cache misses, branch mispredictions, and the completed instructions.
△ Less
Submitted 20 February, 2020;
originally announced February 2020.
-
Universal Domain Adaptation through Self Supervision
Authors:
Kuniaki Saito,
Donghyun Kim,
Stan Sclaroff,
Kate Saenko
Abstract:
Unsupervised domain adaptation methods traditionally assume that all source categories are present in the target domain. In practice, little may be known about the category overlap between the two domains. While some methods address target settings with either partial or open-set categories, they assume that the particular setting is known a priori. We propose a more universally applicable domain…
▽ More
Unsupervised domain adaptation methods traditionally assume that all source categories are present in the target domain. In practice, little may be known about the category overlap between the two domains. While some methods address target settings with either partial or open-set categories, they assume that the particular setting is known a priori. We propose a more universally applicable domain adaptation framework that can handle arbitrary category shift, called Domain Adaptative Neighborhood Clustering via Entropy optimization (DANCE). DANCE combines two novel ideas: First, as we cannot fully rely on source categories to learn features discriminative for the target, we propose a novel neighborhood clustering technique to learn the structure of the target domain in a self-supervised way. Second, we use entropy-based feature alignment and rejection to align target features with the source, or reject them as unknown categories based on their entropy. We show through extensive experiments that DANCE outperforms baselines across open-set, open-partial and partial domain adaptation settings. Implementation is available at https://github.com/VisionLearningGroup/DANCE.
△ Less
Submitted 5 October, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
MULE: Multimodal Universal Language Embedding
Authors:
Donghyun Kim,
Kuniaki Saito,
Kate Saenko,
Stan Sclaroff,
Bryan A. Plummer
Abstract:
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. The…
▽ More
Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of MULE on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 21.9% on a single-language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available.
△ Less
Submitted 28 December, 2019; v1 submitted 8 September, 2019;
originally announced September 2019.
-
Remark on subcodes of linear complementary dual codes
Authors:
Masaaki Harada,
Ken Saito
Abstract:
We show that any ternary Euclidean (resp.\ quaternary Hermitian) linear complementary dual $[n,k]$ code contains a Euclidean (resp.\ Hermitian) linear complementary dual $[n,k-1]$ subcode for $2 \le k \le n$. As a consequence, we derive a bound on the largest minimum weights among ternary Euclidean linear complementary dual codes and quaternary Hermitian linear complementary dual codes.
We show that any ternary Euclidean (resp.\ quaternary Hermitian) linear complementary dual $[n,k]$ code contains a Euclidean (resp.\ Hermitian) linear complementary dual $[n,k-1]$ subcode for $2 \le k \le n$. As a consequence, we derive a bound on the largest minimum weights among ternary Euclidean linear complementary dual codes and quaternary Hermitian linear complementary dual codes.
△ Less
Submitted 23 August, 2019;
originally announced August 2019.
-
On the minimum weights of binary LCD codes and ternary LCD codes
Authors:
Makoto Araya,
Masaaki Harada,
Ken Saito
Abstract:
Linear complementary dual (LCD) codes are linear codes that intersect with their dual codes trivially. We study the largest minimum weight $d_2(n,k)$ among all binary LCD $[n,k]$ codes and the largest minimum weight $d_3(n,k)$ among all ternary LCD $[n,k]$ codes. The largest minimum weights $d_2(n,5)$ and $d_3(n,4)$ are partially determined. We also determine the largest minimum weights…
▽ More
Linear complementary dual (LCD) codes are linear codes that intersect with their dual codes trivially. We study the largest minimum weight $d_2(n,k)$ among all binary LCD $[n,k]$ codes and the largest minimum weight $d_3(n,k)$ among all ternary LCD $[n,k]$ codes. The largest minimum weights $d_2(n,5)$ and $d_3(n,4)$ are partially determined. We also determine the largest minimum weights $d_2(n,n-5)$, $d_3(n,n-i)$ for $i \in \{2,3,4\}$, and $d_3(n,k)$ for $n \in \{11,12,\ldots,19\}$.
△ Less
Submitted 18 November, 2020; v1 submitted 23 August, 2019;
originally announced August 2019.
-
Characterization and classification of optimal LCD codes
Authors:
Makoto Araya,
Masaaki Harada,
Ken Saito
Abstract:
Linear complementary dual (LCD) codes are linear codes that intersect with their dual trivially. We give a characterization of LCD codes over $\mathbb{F}_q$ having large minimum weights for $q \in \{2,3\}$. Using the characterization, we determine the largest minimum weights among LCD $[n,k]$ codes over $\mathbb{F}_q$ for $(q,k) \in \{(2,4), (3,2),(3,3)\}$. Moreover, we give a complete classificat…
▽ More
Linear complementary dual (LCD) codes are linear codes that intersect with their dual trivially. We give a characterization of LCD codes over $\mathbb{F}_q$ having large minimum weights for $q \in \{2,3\}$. Using the characterization, we determine the largest minimum weights among LCD $[n,k]$ codes over $\mathbb{F}_q$ for $(q,k) \in \{(2,4), (3,2),(3,3)\}$. Moreover, we give a complete classification of optimal LCD $[n,k]$ codes over $\mathbb{F}_q$ for $(q,k) \in \{(2,3), (2,4), (3,2),(3,3)\}$.
△ Less
Submitted 4 January, 2021; v1 submitted 8 August, 2019;
originally announced August 2019.
-
Quaternary Hermitian linear complementary dual codes
Authors:
Makoto Araya,
Masaaki Harada,
Ken Saito
Abstract:
The largest minimum weights among quaternary Hermitian linear complementary dual codes are known for dimension $2$. In this paper, we give some conditions for the nonexistence of quaternary Hermitian linear complementary dual codes with large minimum weights. As an application, we completely determine the largest minimum weights for dimension $3$, by using a classification of some quaternary codes…
▽ More
The largest minimum weights among quaternary Hermitian linear complementary dual codes are known for dimension $2$. In this paper, we give some conditions for the nonexistence of quaternary Hermitian linear complementary dual codes with large minimum weights. As an application, we completely determine the largest minimum weights for dimension $3$, by using a classification of some quaternary codes. In addition, for a positive integer $s$, a maximal entanglement entanglement-assisted quantum $[[21s+5,3,16s+3;21s+2]]$ codes is constructed for the first time from a quaternary Hermitian linear complementary dual $[26,3,19]$ code.
△ Less
Submitted 27 December, 2019; v1 submitted 16 April, 2019;
originally announced April 2019.
-
Semi-supervised Domain Adaptation via Minimax Entropy
Authors:
Kuniaki Saito,
Donghyun Kim,
Stan Sclaroff,
Trevor Darrell,
Kate Saenko
Abstract:
Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any target supervision. However, we show that these techniques perform poorly when even a few labeled examples are available in the target. To address this semi-supervised domain adaptation (SSDA) setting, we propose a novel Minimax Entropy (MME) approach that adversaria…
▽ More
Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any target supervision. However, we show that these techniques perform poorly when even a few labeled examples are available in the target. To address this semi-supervised domain adaptation (SSDA) setting, we propose a novel Minimax Entropy (MME) approach that adversarially optimizes an adaptive few-shot model. Our base model consists of a feature encoding network, followed by a classification layer that computes the features' similarity to estimated prototypes (representatives of each class). Adaptation is achieved by alternately maximizing the conditional entropy of unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder. We empirically demonstrate the superiority of our method over many baselines, including conventional feature alignment and few-shot methods, setting a new state of the art for SSDA.
△ Less
Submitted 14 September, 2019; v1 submitted 13 April, 2019;
originally announced April 2019.
-
TWINs: Two Weighted Inconsistency-reduced Networks for Partial Domain Adaptation
Authors:
Toshihiko Matsuura,
Kuniaki Saito,
Tatsuya Harada
Abstract:
The task of unsupervised domain adaptation is proposed to transfer the knowledge of a label-rich domain (source domain) to a label-scarce domain (target domain). Matching feature distributions between different domains is a widely applied method for the aforementioned task. However, the method does not perform well when classes in the two domains are not identical. Specifically, when the classes o…
▽ More
The task of unsupervised domain adaptation is proposed to transfer the knowledge of a label-rich domain (source domain) to a label-scarce domain (target domain). Matching feature distributions between different domains is a widely applied method for the aforementioned task. However, the method does not perform well when classes in the two domains are not identical. Specifically, when the classes of the target correspond to a subset of those of the source, target samples can be incorrectly aligned with the classes that exist only in the source. This problem setting is termed as partial domain adaptation (PDA). In this study, we propose a novel method called Two Weighted Inconsistency-reduced Networks (TWINs) for PDA. We utilize two classification networks to estimate the ratio of the target samples in each class with which a classification loss is weighted to adapt the classes present in the target domain. Furthermore, to extract discriminative features for the target, we propose to minimize the divergence between domains measured by the classifiers' inconsistency on target samples. We empirically demonstrate that reducing the inconsistency between two networks is effective for PDA and that our method outperforms other existing methods with a large margin in several datasets.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.
-
Strong-Weak Distribution Alignment for Adaptive Object Detection
Authors:
Kuniaki Saito,
Yoshitaka Ushiku,
Tatsuya Harada,
Kate Saenko
Abstract:
We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions of source and target images using an adversarial loss have been proven effective for adapting object classifiers. However, for object detection, fully matching the entire…
▽ More
We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions of source and target images using an adversarial loss have been proven effective for adapting object classifiers. However, for object detection, fully matching the entire distributions of source and target images to each other at the global image level may fail, as domains could have distinct scene layouts and different combinations of objects. On the other hand, strong matching of local features such as texture and color makes sense, as it does not change category level semantics. This motivates us to propose a novel method for detector adaptation based on strong local alignment and weak global alignment. Our key contribution is the weak alignment model, which focuses the adversarial alignment loss on images that are globally similar and puts less emphasis on aligning images that are globally dissimilar. Additionally, we design the strong domain alignment model to only look at local receptive fields of the feature map. We empirically verify the effectiveness of our method on four datasets comprising both large and small domain shifts. Our code is available at \url{https://github.com/VisionLearningGroup/DA_Detection}
△ Less
Submitted 5 April, 2019; v1 submitted 11 December, 2018;
originally announced December 2018.
-
Multichannel Semantic Segmentation with Unsupervised Domain Adaptation
Authors:
Kohei Watanabe,
Kuniaki Saito,
Yoshitaka Ushiku,
Tatsuya Harada
Abstract:
Most contemporary robots have depth sensors, and research on semantic segmentation with RGBD images has shown that depth images boost the accuracy of segmentation. Since it is time-consuming to annotate images with semantic labels per pixel, it would be ideal if we could avoid this laborious work by utilizing an existing dataset or a synthetic dataset which we can generate on our own. Robot motion…
▽ More
Most contemporary robots have depth sensors, and research on semantic segmentation with RGBD images has shown that depth images boost the accuracy of segmentation. Since it is time-consuming to annotate images with semantic labels per pixel, it would be ideal if we could avoid this laborious work by utilizing an existing dataset or a synthetic dataset which we can generate on our own. Robot motions are often tested in a synthetic environment, where multichannel (eg, RGB + depth + instance boundary) images plus their pixel-level semantic labels are available. However, models trained simply on synthetic images tend to demonstrate poor performance on real images. In order to address this, we propose two approaches that can efficiently exploit multichannel inputs combined with an unsupervised domain adaptation (UDA) algorithm. One is a fusion-based approach that uses depth images as inputs. The other is a multitask learning approach that uses depth images as outputs. We demonstrated that the segmentation results were improved by using a multitask learning approach with a post-process and created a benchmark for this task.
△ Less
Submitted 11 December, 2018;
originally announced December 2018.
-
Dynamic Block Matching to assess the longitudinal component of the dense motion field of the carotid artery wall in B-mode ultrasound sequences -- Association with coronary artery disease
Authors:
Guillaume Zahnd,
Kozue Saito,
Kazuyuki Nagatsuka,
Yoshito Otake,
Yoshinobu Sato
Abstract:
Purpose: The motion of the common carotid artery tissue layers along the vessel axis during the cardiac cycle, observed in ultrasound imaging, is associated with the presence of established cardiovascular risk factors. However, the vast majority of the methods are based on the tracking of a single point, thus failing to capture the overall motion of the entire arterial wall. The aim of this work i…
▽ More
Purpose: The motion of the common carotid artery tissue layers along the vessel axis during the cardiac cycle, observed in ultrasound imaging, is associated with the presence of established cardiovascular risk factors. However, the vast majority of the methods are based on the tracking of a single point, thus failing to capture the overall motion of the entire arterial wall. The aim of this work is to introduce a motion tracking framework able to simultaneously extract the trajectory of a large collection of points spanning the entire exploitable width of the image.
Method: The longitudinal motion, which is the main focus of the present work, is determined in two steps. First, a series of independent block matching operations are carried out for all the tracked points. Then, an original dynamic-programming approach is exploited to regularize the collection of similarity maps and estimate the globally optimal motion over the entire vessel wall. Sixty-two atherosclerotic participants at high cardiovascular risk were involved in this study.
Results: A dense displacement field, describing the longitudinal motion of the carotid far wall over time, was extracted. For each cine-loop, the method was evaluated against manual reference tracings performed on three local points, with an average absolute error of 150+/-163 um. A strong correlation was found between motion inhomogeneity and the presence of coronary artery disease (beta-coefficient=0.586, p=0.003).
Conclusions: To the best of our knowledge, this is the first time that a method is specifically proposed to assess the dense motion field of the carotid far wall. This approach has potential to evaluate the (in)homogeneity of the wall dynamics. The proposed method has promising performances to improve the analysis of arterial longitudinal motion and the understanding of the underlying patho-physiological parameters.
△ Less
Submitted 18 May, 2020; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Syn2Real: A New Benchmark forSynthetic-to-Real Visual Domain Adaptation
Authors:
Xingchao Peng,
Ben Usman,
Kuniaki Saito,
Neela Kaushik,
Judy Hoffman,
Kate Saenko
Abstract:
Unsupervised transfer of object recognition models from synthetic to real data is an important problem with many potential applications. The challenge is how to "adapt" a model trained on simulated images so that it performs well on real-world data without any additional supervision. Unfortunately, current benchmarks for this problem are limited in size and task diversity. In this paper, we presen…
▽ More
Unsupervised transfer of object recognition models from synthetic to real data is an important problem with many potential applications. The challenge is how to "adapt" a model trained on simulated images so that it performs well on real-world data without any additional supervision. Unfortunately, current benchmarks for this problem are limited in size and task diversity. In this paper, we present a new large-scale benchmark called Syn2Real, which consists of a synthetic domain rendered from 3D object models and two real-image domains containing the same object categories. We define three related tasks on this benchmark: closed-set object classification, open-set object classification, and object detection. Our evaluation of multiple state-of-the-art methods reveals a large gap in adaptation performance between the easier closed-set classification task and the more difficult open-set and detection tasks. We conclude that developing adaptation methods that work well across all three tasks presents a significant future challenge for syn2real domain transfer.
△ Less
Submitted 25 June, 2018;
originally announced June 2018.
-
Open Set Domain Adaptation by Backpropagation
Authors:
Kuniaki Saito,
Shohei Yamamoto,
Yoshitaka Ushiku,
Tatsuya Harada
Abstract:
Numerous algorithms have been proposed for transferring knowledge from a label-rich domain (source) to a label-scarce domain (target). Almost all of them are proposed for a closed-set scenario, where the source and the target domain completely share the class of their samples. We call the shared class the \doublequote{known class.} However, in practice, when samples in target domain are not labele…
▽ More
Numerous algorithms have been proposed for transferring knowledge from a label-rich domain (source) to a label-scarce domain (target). Almost all of them are proposed for a closed-set scenario, where the source and the target domain completely share the class of their samples. We call the shared class the \doublequote{known class.} However, in practice, when samples in target domain are not labeled, we cannot know whether the domains share the class. A target domain can contain samples of classes that are not shared by the source domain. We call such classes the \doublequote{unknown class} and algorithms that work well in the open set situation are very practical. However, most existing distribution matching methods for domain adaptation do not work well in this setting because unknown target samples should not be aligned with the source.
In this paper, we propose a method for an open set domain adaptation scenario which utilizes adversarial training. A classifier is trained to make a boundary between the source and the target samples whereas a generator is trained to make target samples far from the boundary. Thus, we assign two options to the feature generator: aligning them with source known samples or rejecting them as unknown target samples. This approach allows extracting features that separate unknown target samples from known target samples. Our method was extensively evaluated in domain adaptation setting and outperformed other methods with a large margin in most settings.
△ Less
Submitted 6 July, 2018; v1 submitted 27 April, 2018;
originally announced April 2018.