-
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
Authors:
LLM-jp,
:,
Akiko Aizawa,
Eiji Aramaki,
Bowen Chen,
Fei Cheng,
Hiroyuki Deguchi,
Rintaro Enomoto,
Kazuki Fujii,
Kensuke Fukumoto,
Takuya Fukushima,
Namgi Han,
Yuto Harada,
Chikara Hashimoto,
Tatsuya Hiraoka,
Shohei Hisada,
Sosuke Hosokawa,
Lu Jie,
Keisuke Kamata,
Teruhito Kanazawa,
Hiroki Kanezashi,
Hiroshi Kataoka,
Satoru Katsumata,
Daisuke Kawahara,
Seiya Kawano
, et al. (57 additional authors not shown)
Abstract:
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its…
▽ More
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
MELD-ST: An Emotion-aware Speech Translation Dataset
Authors:
Sirou Chen,
Sakiko Yahata,
Shuichiro Shimizu,
Zhengdong Yang,
Yihang Li,
Chenhui Chu,
Sadao Kurohashi
Abstract:
Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline e…
▽ More
Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Authors:
Nobuhiro Ueda,
Hideko Habe,
Yoko Matsui,
Akishige Yuguchi,
Seiya Kawano,
Yasutomo Kawanishi,
Sadao Kurohashi,
Koichiro Yoshino
Abstract:
Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal referen…
▽ More
Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
AcTED: Automatic Acquisition of Typical Event Duration for Semi-supervised Temporal Commonsense QA
Authors:
Felix Virgo,
Fei Cheng,
Lis Kanashiro Pereira,
Masayuki Asahara,
Ichiro Kobayashi,
Sadao Kurohashi
Abstract:
We propose a voting-driven semi-supervised approach to automatically acquire the typical duration of an event and use it as pseudo-labeled data. The human evaluation demonstrates that our pseudo labels exhibit surprisingly high accuracy and balanced coverage. In the temporal commonsense QA task, experimental results show that using only pseudo examples of 400 events, we achieve performance compara…
▽ More
We propose a voting-driven semi-supervised approach to automatically acquire the typical duration of an event and use it as pseudo-labeled data. The human evaluation demonstrates that our pseudo labels exhibit surprisingly high accuracy and balanced coverage. In the temporal commonsense QA task, experimental results show that using only pseudo examples of 400 events, we achieve performance comparable to the existing BERT-based weakly supervised approaches that require a significant amount of training examples. When compared to the RoBERTa baselines, our best approach establishes state-of-the-art performance with a 7% improvement in Exact Match.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese
Authors:
Yikun Sun,
Zhen Wan,
Nobuhiro Ueda,
Sakiko Yahata,
Fei Cheng,
Chenhui Chu,
Sadao Kurohashi
Abstract:
The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propos…
▽ More
The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37\% win-rate. The human evaluation exhibits the consistency between GPT-4's assessments and human preference. Our high-quality instruction data and evaluation benchmark have been released here.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
RecMind: Japanese Movie Recommendation Dialogue with Seeker's Internal State
Authors:
Takashi Kodama,
Hirokazu Kiyomaru,
Yin Jou Huang,
Sadao Kurohashi
Abstract:
Humans pay careful attention to the interlocutor's internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker's internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis, we constructed RecMind, a Japanese movie recommendation dialogue dataset with annotations of…
▽ More
Humans pay careful attention to the interlocutor's internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker's internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis, we constructed RecMind, a Japanese movie recommendation dialogue dataset with annotations of the seeker's internal state at the entity level. Each entity has a subjective label annotated by the seeker and an objective label annotated by the recommender. RecMind also features engaging dialogues with long seeker's utterances, enabling a detailed analysis of the seeker's internal state. Our analysis based on RecMind reveals that entities that the seeker has no knowledge about but has an interest in contribute to recommendation success. We also propose a response generation framework that explicitly considers the seeker's internal state, utilizing the chain-of-thought prompting. The human evaluation results show that our proposed method outperforms the baseline method in both consistency and the success of recommendations.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts
Authors:
Haiyue Song,
Raj Dabre,
Chenhui Chu,
Atsushi Fujita,
Sadao Kurohashi
Abstract:
Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we…
▽ More
Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at https://github.com/shyyhs/CourseraParallelCorpusMining.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Authors:
Fei Cheng,
Masayuki Asahara,
Ichiro Kobayashi,
Sadao Kurohashi
Abstract:
Temporal relation classification is a pair-wise task for identifying the relation of a temporal link (TLINK) between two mentions, i.e. event, time, and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T, and E2D) hinder from using the whol…
▽ More
Temporal relation classification is a pair-wise task for identifying the relation of a temporal link (TLINK) between two mentions, i.e. event, time, and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T, and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two transfer learning baselines on both the English and Japanese data.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Video-Helpful Multimodal Machine Translation
Authors:
Yihang Li,
Shuichiro Shimizu,
Chenhui Chu,
Sadao Kurohashi,
Wei Li
Abstract:
Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily c…
▽ More
Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models. The EVA dataset and the SAFA model are available at: https://github.com/ku-nlp/video-helpful-MMT.git.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise
Authors:
Zhen wan,
Yating Zhang,
Yexiang Wang,
Fei Cheng,
Sadao Kurohashi
Abstract:
While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-doma…
▽ More
While large language models (LLMs) like GPT-4 have recently demonstrated astonishing zero-shot capabilities in general domain tasks, they often generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas. This is typically due to the absence of training data that encompasses such a specific domain, preventing GPT-4 from acquiring in-domain knowledge. A pressing challenge is that it's not plausible to continue training LLMs of such scale on in-domain data.
This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process. The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain by continuing learning on in-domain data. When solving a task, we leverage the adapted LLM to generate a draft answer given a task query. Then, the draft answer will be used to \textbf{retrieve} supporting evidence candidates from an external in-domain knowledge base. Finally, the draft answer and retrieved evidence are concatenated into a whole prompt to let GPT-4 assess the evidence and \textbf{revise} the draft answer to generate the final answer.
Our proposal combines the advantages of the efficiency of adapting a smaller 7B model with the evidence-assessing capability of GPT-4 and effectively prevents GPT-4 from generating hallucinatory content. In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to the direct generation by GPT-4. When compared to two stronger retrieval-based baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be released
△ Less
Submitted 12 October, 2023; v1 submitted 5 October, 2023;
originally announced October 2023.
-
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation
Authors:
Haiyue Song,
Raj Dabre,
Chenhui Chu,
Sadao Kurohashi,
Eiichiro Sumita
Abstract:
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to…
▽ More
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle- and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16 Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En and WMT14 Fr->En datasets.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting
Authors:
Tatsuro Inaba,
Hirokazu Kiyomaru,
Fei Cheng,
Sadao Kurohashi
Abstract:
Large language models (LLMs) have achieved impressive performance on various reasoning tasks. To further improve the performance, we propose MultiTool-CoT, a novel framework that leverages chain-of-thought (CoT) prompting to incorporate multiple external tools, such as a calculator and a knowledge retriever, during the reasoning process. We apply MultiTool-CoT to the Task 2 dataset of NumGLUE, whi…
▽ More
Large language models (LLMs) have achieved impressive performance on various reasoning tasks. To further improve the performance, we propose MultiTool-CoT, a novel framework that leverages chain-of-thought (CoT) prompting to incorporate multiple external tools, such as a calculator and a knowledge retriever, during the reasoning process. We apply MultiTool-CoT to the Task 2 dataset of NumGLUE, which requires both numerical reasoning and domain-specific knowledge. The experiments show that our method significantly outperforms strong baselines and achieves state-of-the-art performance.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation
Authors:
Zhuoyuan Mao,
Haiyue Song,
Raj Dabre,
Chenhui Chu,
Sadao Kurohashi
Abstract:
The language-independency of encoded representations within multilingual neural machine translation (MNMT) models is crucial for their generalization ability on zero-shot translation. Neural interlingua representations have been shown as an effective method for achieving this. However, fixed-length neural interlingua representations introduced in previous work can limit its flexibility and represe…
▽ More
The language-independency of encoded representations within multilingual neural machine translation (MNMT) models is crucial for their generalization ability on zero-shot translation. Neural interlingua representations have been shown as an effective method for achieving this. However, fixed-length neural interlingua representations introduced in previous work can limit its flexibility and representation ability. In this study, we introduce a novel method to enhance neural interlingua representations by making their length variable, thereby overcoming the constraint of fixed-length neural interlingua representations. Our empirical results on zero-shot translation on OPUS, IWSLT, and Europarl datasets demonstrate stable model convergence and superior zero-shot translation results compared to fixed-length neural interlingua representations. However, our analysis reveals the suboptimal efficacy of our approach in translating from certain source languages, wherein we pinpoint the defective model component in our proposed method.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation
Authors:
Zhuoyuan Mao,
Raj Dabre,
Qianying Liu,
Haiyue Song,
Chenhui Chu,
Sadao Kurohashi
Abstract:
This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may ov…
▽ More
This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may overfit supervised directions and thus have low generalizability for ZST. Through experiments on OPUS, IWSLT, and Europarl datasets for 54 ZST directions, we demonstrate that the original Transformer setting of LayerNorm after residual connections (PostNorm) consistently outperforms PreNorm by up to 12.3 BLEU points. We then study the performance disparities by analyzing the differences in off-target rates and structural variations between PreNorm and PostNorm. This study highlights the need for careful consideration of the LayerNorm setting for ZST.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Towards Speech Dialogue Translation Mediating Speakers of Different Languages
Authors:
Shuichiro Shimizu,
Chenhui Chu,
Sheng Li,
Sadao Kurohashi
Abstract:
We present a new task, speech dialogue translation mediating speakers of different languages. We construct the SpeechBSD dataset for the task and conduct baseline experiments. Furthermore, we consider context to be an important aspect that needs to be addressed in this task and propose two ways of utilizing context, namely monolingual context and bilingual context. We conduct cascaded speech trans…
▽ More
We present a new task, speech dialogue translation mediating speakers of different languages. We construct the SpeechBSD dataset for the task and conduct baseline experiments. Furthermore, we consider context to be an important aspect that needs to be addressed in this task and propose two ways of utilizing context, namely monolingual context and bilingual context. We conduct cascaded speech translation experiments using Whisper and mBART, and show that bilingual context performs better in our settings.
△ Less
Submitted 22 May, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation
Authors:
Junfeng Jiang,
Chengzhang Dong,
Sadao Kurohashi,
Akiko Aizawa
Abstract:
Dialogue segmentation is a crucial task for dialogue systems allowing a better understanding of conversational texts. Despite recent progress in unsupervised dialogue segmentation methods, their performances are limited by the lack of explicit supervised signals for training. Furthermore, the precise definition of segmentation points in conversations still remains as a challenging problem, increas…
▽ More
Dialogue segmentation is a crucial task for dialogue systems allowing a better understanding of conversational texts. Despite recent progress in unsupervised dialogue segmentation methods, their performances are limited by the lack of explicit supervised signals for training. Furthermore, the precise definition of segmentation points in conversations still remains as a challenging problem, increasing the difficulty of collecting manual annotations. In this paper, we provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues and release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues based on two prevalent document-grounded dialogue corpora, and also inherit their useful dialogue-related annotations. Moreover, we provide a benchmark including 18 models across five categories for the dialogue segmentation task with several proper evaluation metrics. Empirical studies show that supervised learning is extremely effective in in-domain datasets and models trained on SuperDialseg can achieve good generalization ability on out-of-domain data. Additionally, we also conducted human verification on the test set and the Kappa score confirmed the quality of our automatically constructed dataset. We believe our work is an important step forward in the field of dialogue segmentation. Our codes and data can be found from: https://github.com/Coldog2333/SuperDialseg.
△ Less
Submitted 15 October, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
Comprehensive Solution Program Centric Pretraining for Table-and-Text Hybrid Numerical Reasoning
Authors:
Qianying Liu,
Dongsheng Yang,
Wenjie Zhong,
Fei Cheng,
Sadao Kurohashi
Abstract:
Numerical reasoning over table-and-text hybrid passages, such as financial reports, poses significant challenges and has numerous potential applications. Noise and irrelevant variables in the model input have been a hindrance to its performance. Additionally, coarse-grained supervision of the whole solution program has impeded the model's ability to learn the underlying numerical reasoning process…
▽ More
Numerical reasoning over table-and-text hybrid passages, such as financial reports, poses significant challenges and has numerous potential applications. Noise and irrelevant variables in the model input have been a hindrance to its performance. Additionally, coarse-grained supervision of the whole solution program has impeded the model's ability to learn the underlying numerical reasoning process. In this paper, we propose three pretraining tasks that operate at both the whole program and sub-program level: Variable Integrity Ranking, which guides the model to focus on useful variables; Variable Operator Prediction, which decomposes the supervision into fine-grained single operator prediction; and Variable Keyphrase Masking, which encourages the model to identify key evidence that sub-programs are derived from. Experimental results demonstrate the effectiveness of our proposed methods, surpassing transformer-based model baselines.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
GPT-RE: In-context Learning for Relation Extraction using Large Language Models
Authors:
Zhen Wan,
Fei Cheng,
Zhuoyuan Mao,
Qianying Liu,
Haiyue Song,
Jiwei Li,
Sadao Kurohashi
Abstract:
In spite of the potential for ground-breaking achievements offered by large language models (LLMs) (e.g., GPT-3), they still lag significantly behind fully-supervised baselines (e.g., fine-tuned BERT) in relation extraction (RE). This is due to the two major shortcomings of LLMs in RE: (1) low relevance regarding entity and relation in retrieved demonstrations for in-context learning; and (2) the…
▽ More
In spite of the potential for ground-breaking achievements offered by large language models (LLMs) (e.g., GPT-3), they still lag significantly behind fully-supervised baselines (e.g., fine-tuned BERT) in relation extraction (RE). This is due to the two major shortcomings of LLMs in RE: (1) low relevance regarding entity and relation in retrieved demonstrations for in-context learning; and (2) the strong inclination to wrongly classify NULL examples into other pre-defined labels.
In this paper, we propose GPT-RE to bridge the gap between LLMs and fully-supervised baselines. GPT-RE successfully addresses the aforementioned issues by (1) incorporating task-specific entity representations in demonstration retrieval; and (2) enriching the demonstrations with gold label-induced reasoning logic. We evaluate GPT-RE on four widely-used RE datasets, and observe that GPT-RE achieves improvements over not only existing GPT-3 baselines, but also fully-supervised baselines. Specifically, GPT-RE achieves SOTA performances on the Semeval and SciERC datasets, and competitive performances on the TACRED and ACE05 datasets.
△ Less
Submitted 8 December, 2023; v1 submitted 3 May, 2023;
originally announced May 2023.
-
Textual Enhanced Contrastive Learning for Solving Math Word Problems
Authors:
Yibin Shen,
Qianying Liu,
Zhuoyuan Mao,
Fei Cheng,
Sadao Kurohashi
Abstract:
Solving math word problems is the task that analyses the relation of quantities and requires an accurate understanding of contextual natural language information. Recent studies show that current models rely on shallow heuristics to predict solutions and could be easily misled by small textual perturbations. To address this problem, we propose a Textual Enhanced Contrastive Learning framework, whi…
▽ More
Solving math word problems is the task that analyses the relation of quantities and requires an accurate understanding of contextual natural language information. Recent studies show that current models rely on shallow heuristics to predict solutions and could be easily misled by small textual perturbations. To address this problem, we propose a Textual Enhanced Contrastive Learning framework, which enforces the models to distinguish semantically similar examples while holding different mathematical logic. We adopt a self-supervised manner strategy to enrich examples with subtle textual variance by textual reordering or problem re-construction. We then retrieve the hardest to differentiate samples from both equation and textual perspectives and guide the model to learn their representations. Experimental results show that our method achieves state-of-the-art on both widely used benchmark datasets and also exquisitely designed challenge datasets in English and Chinese. \footnote{Our code and data is available at \url{https://github.com/yiyunya/Textual_CL_MWP}
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction
Authors:
Zhen Wan,
Qianying Liu,
Zhuoyuan Mao,
Fei Cheng,
Sadao Kurohashi,
Jiwei Li
Abstract:
Relation extraction (RE) has achieved remarkable progress with the help of pre-trained language models. However, existing RE models are usually incapable of handling two situations: implicit expressions and long-tail relation types, caused by language complexity and data sparsity. In this paper, we introduce a simple enhancement of RE using $k$ nearest neighbors ($k$NN-RE). $k$NN-RE allows the mod…
▽ More
Relation extraction (RE) has achieved remarkable progress with the help of pre-trained language models. However, existing RE models are usually incapable of handling two situations: implicit expressions and long-tail relation types, caused by language complexity and data sparsity. In this paper, we introduce a simple enhancement of RE using $k$ nearest neighbors ($k$NN-RE). $k$NN-RE allows the model to consult training relations at test time through a nearest-neighbor search and provides a simple yet effective means to tackle the two issues above. Additionally, we observe that $k$NN-RE serves as an effective way to leverage distant supervision (DS) data for RE. Experimental results show that the proposed $k$NN-RE achieves state-of-the-art performances on a variety of supervised RE datasets, i.e., ACE05, SciERC, and Wiki80, along with outperforming the best model to date on the i2b2 and Wiki80 datasets in the setting of allowing using DS. Our code and models are available at: https://github.com/YukinoWan/kNN-RE.
△ Less
Submitted 30 January, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
ComSearch: Equation Searching with Combinatorial Strategy for Solving Math Word Problems with Weak Supervision
Authors:
Qianying Liu,
Wenyu Guan,
Jianhao Shen,
Fei Cheng,
Sadao Kurohashi
Abstract:
Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm with combinatorial strategy \textbf{ComSearch}, wh…
▽ More
Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm with combinatorial strategy \textbf{ComSearch}, which can compress the search space by excluding mathematically equivalent equations. The compression allows the searching algorithm to enumerate all possible equations and obtain high-quality data. We investigate the noise in the pseudo labels that hold wrong mathematical logic, which we refer to as the \textit{false-matching} problem, and propose a ranking model to denoise the pseudo labels. Our approach holds a flexible framework to utilize two existing supervised math word problem solvers to train pseudo labels, and both achieve state-of-the-art performance in the weak supervision task.
△ Less
Submitted 7 March, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems
Authors:
Yibin Shen,
Qianying Liu,
Zhuoyuan Mao,
Zhen Wan,
Fei Cheng,
Sadao Kurohashi
Abstract:
To solve Math Word Problems, human students leverage diverse reasoning logic that reaches different possible equation solutions. However, the mainstream sequence-to-sequence approach of automatic solvers aims to decode a fixed solution equation supervised by human annotation. In this paper, we propose a controlled equation generation solver by leveraging a set of control codes to guide the model t…
▽ More
To solve Math Word Problems, human students leverage diverse reasoning logic that reaches different possible equation solutions. However, the mainstream sequence-to-sequence approach of automatic solvers aims to decode a fixed solution equation supervised by human annotation. In this paper, we propose a controlled equation generation solver by leveraging a set of control codes to guide the model to consider certain reasoning logic and decode the corresponding equations expressions transformed from the human reference. The empirical results suggest that our method universally improves the performance on single-unknown (Math23K) and multiple-unknown (DRAW1K, HMWP) benchmarks, with substantial improvements up to 13.2% accuracy on the challenging multiple-unknown datasets.
△ Less
Submitted 29 November, 2022; v1 submitted 21 September, 2022;
originally announced September 2022.
-
EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning
Authors:
Zhuoyuan Mao,
Chenhui Chu,
Sadao Kurohashi
Abstract:
Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, the use of a large amount of data or inefficient model architectures results in heavy computation to train a new model according to our preferred languages and domains. To resolve this issue, we introduce efficient and effective massively…
▽ More
Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, the use of a large amount of data or inefficient model architectures results in heavy computation to train a new model according to our preferred languages and domains. To resolve this issue, we introduce efficient and effective massively multilingual sentence embedding (EMS), using cross-lingual token-level reconstruction (XTR) and sentence-level contrastive learning as training objectives. Compared with related studies, the proposed model can be efficiently trained using significantly fewer parallel sentences and GPU computation resources. Empirical results showed that the proposed model significantly yields better or comparable results with regard to cross-lingual sentence retrieval, zero-shot cross-lingual genre classification, and sentiment classification. Ablative analyses demonstrated the efficiency and effectiveness of each component of the proposed model. We release the codes for model training and the EMS pre-trained sentence embedding model, which supports 62 languages ( https://github.com/Mao-KU/EMS ).
△ Less
Submitted 30 May, 2024; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision
Authors:
Zhen Wan,
Fei Cheng,
Qianying Liu,
Zhuoyuan Mao,
Haiyue Song,
Sadao Kurohashi
Abstract:
Contrastive pre-training on distant supervision has shown remarkable effectiveness in improving supervised relation extraction tasks. However, the existing methods ignore the intrinsic noise of distant supervision during the pre-training stage. In this paper, we propose a weighted contrastive learning method by leveraging the supervised data to estimate the reliability of pre-training instances an…
▽ More
Contrastive pre-training on distant supervision has shown remarkable effectiveness in improving supervised relation extraction tasks. However, the existing methods ignore the intrinsic noise of distant supervision during the pre-training stage. In this paper, we propose a weighted contrastive learning method by leveraging the supervised data to estimate the reliability of pre-training instances and explicitly reduce the effect of noise. Experimental results on three supervised datasets demonstrate the advantages of our proposed weighted contrastive learning approach compared to two state-of-the-art non-weighted baselines.Our code and models are available at: https://github.com/YukinoWan/WCL
△ Less
Submitted 10 February, 2023; v1 submitted 18 May, 2022;
originally announced May 2022.
-
When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?
Authors:
Zhuoyuan Mao,
Chenhui Chu,
Raj Dabre,
Haiyue Song,
Zhen Wan,
Sadao Kurohashi
Abstract:
Word alignment has proven to benefit many-to-many neural machine translation (NMT). However, high-quality ground-truth bilingual dictionaries were used for pre-editing in previous methods, which are unavailable for most language pairs. Meanwhile, the contrastive objective can implicitly utilize automatically learned word alignment, which has not been explored in many-to-many NMT. This work propose…
▽ More
Word alignment has proven to benefit many-to-many neural machine translation (NMT). However, high-quality ground-truth bilingual dictionaries were used for pre-editing in previous methods, which are unavailable for most language pairs. Meanwhile, the contrastive objective can implicitly utilize automatically learned word alignment, which has not been explored in many-to-many NMT. This work proposes a word-level contrastive objective to leverage word alignments for many-to-many NMT. Empirical results show that this leads to 0.8 BLEU gains for several language pairs. Analyses reveal that in many-to-many NMT, the encoder's sentence retrieval performance highly correlates with the translation quality, which explains when the proposed method impacts translation. This motivates future exploration for many-to-many NMT to improve the encoder's sentence retrieval performance.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Hierarchical Softmax for End-to-End Low-resource Multilingual Speech Recognition
Authors:
Qianying Liu,
Zhuo Gong,
Zhengdong Yang,
Yuhang Yang,
Sheng Li,
Chenchen Ding,
Nobuaki Minematsu,
Hao Huang,
Fei Cheng,
Chenhui Chu,
Sadao Kurohashi
Abstract:
Low-resource speech recognition has been long-suffering from insufficient training data. In this paper, we propose an approach that leverages neighboring languages to improve low-resource scenario performance, founded on the hypothesis that similar linguistic units in neighboring languages exhibit comparable term frequency distributions, which enables us to construct a Huffman tree for performing…
▽ More
Low-resource speech recognition has been long-suffering from insufficient training data. In this paper, we propose an approach that leverages neighboring languages to improve low-resource scenario performance, founded on the hypothesis that similar linguistic units in neighboring languages exhibit comparable term frequency distributions, which enables us to construct a Huffman tree for performing multilingual hierarchical Softmax decoding. This hierarchical structure enables cross-lingual knowledge sharing among similar tokens, thereby enhancing low-resource training outcomes. Empirical analyses demonstrate that our method is effective in improving the accuracy and efficiency of low-resource speech recognition.
△ Less
Submitted 30 April, 2023; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation
Authors:
Zhuoyuan Mao,
Chenhui Chu,
Sadao Kurohashi
Abstract:
In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units kn…
▽ More
In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese--English & Japanese--Chinese, Wikipedia Japanese--Chinese, News English--Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese--English tasks, up to +7.0 BLEU points for the Japanese--Chinese tasks and up to +1.3 BLEU points for English--Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency. We release codes here: https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
-
VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation
Authors:
Yihang Li,
Shuichiro Shimizu,
Weiqi Gu,
Chenhui Chu,
Sadao Kurohashi
Abstract:
Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features: (1)…
▽ More
Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips with the following key features: (1) the parallel sentences are subtitles from movies and TV episodes; (2) the source subtitles are ambiguous, which means they have multiple possible translations with different meanings; (3) we divide the dataset into Polysemy and Omission according to the cause of ambiguity. We show that VISA is challenging for the latest MMT system, and we hope that the dataset can facilitate MMT research. The VISA dataset is available at: https://github.com/ku-nlp/VISA.
△ Less
Submitted 26 May, 2022; v1 submitted 20 January, 2022;
originally announced January 2022.
-
Cross-lingual Adaption Model-Agnostic Meta-Learning for Natural Language Understanding
Authors:
Qianying Liu,
Fei Cheng,
Sadao Kurohashi
Abstract:
Meta learning with auxiliary languages has demonstrated promising improvements for cross-lingual natural language processing. However, previous studies sample the meta-training and meta-testing data from the same language, which limits the ability of the model for cross-lingual transfer. In this paper, we propose XLA-MAML, which performs direct cross-lingual adaption in the meta-learning stage. We…
▽ More
Meta learning with auxiliary languages has demonstrated promising improvements for cross-lingual natural language processing. However, previous studies sample the meta-training and meta-testing data from the same language, which limits the ability of the model for cross-lingual transfer. In this paper, we propose XLA-MAML, which performs direct cross-lingual adaption in the meta-learning stage. We conduct zero-shot and few-shot experiments on Natural Language Inference and Question Answering. The experimental results demonstrate the effectiveness of our method across different languages, tasks, and pretrained models. We also give analysis on various cross-lingual specific settings for meta-learning including sampling strategy and parallelism.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
JaMIE: A Pipeline Japanese Medical Information Extraction System
Authors:
Fei Cheng,
Shuntaro Yada,
Ribeka Tanaka,
Eiji Aramaki,
Sadao Kurohashi
Abstract:
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline sy…
▽ More
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the effective annotation strategy for targeting report types, and the superiority of the latest contextual embedding models.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
Lightweight Cross-Lingual Sentence Representation Learning
Authors:
Zhuoyuan Mao,
Prakhar Gupta,
Pei Wang,
Chenhui Chu,
Martin Jaggi,
Sadao Kurohashi
Abstract:
Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture wit…
▽ More
Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model.
△ Less
Submitted 27 May, 2022; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Frustratingly Easy Edit-based Linguistic Steganography with a Masked Language Model
Authors:
Honai Ueoka,
Yugo Murawaki,
Sadao Kurohashi
Abstract:
With advances in neural language models, the focus of linguistic steganography has shifted from edit-based approaches to generation-based ones. While the latter's payload capacity is impressive, generating genuine-looking texts remains challenging. In this paper, we revisit edit-based linguistic steganography, with the idea that a masked language model offers an off-the-shelf solution. The propose…
▽ More
With advances in neural language models, the focus of linguistic steganography has shifted from edit-based approaches to generation-based ones. While the latter's payload capacity is impressive, generating genuine-looking texts remains challenging. In this paper, we revisit edit-based linguistic steganography, with the idea that a masked language model offers an off-the-shelf solution. The proposed method eliminates painstaking rule construction and has a high payload capacity for an edit-based model. It is also shown to be more secure against automatic detection than a generation-based method while offering better control of the security/payload capacity trade-off.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
Modeling and Utilizing User's Internal State in Movie Recommendation Dialogue
Authors:
Takashi Kodama,
Ribeka Tanaka,
Sadao Kurohashi
Abstract:
Intelligent dialogue systems are expected as a new interface between humans and machines. Such an intelligent dialogue system should estimate the user's internal state (UIS) in dialogues and change its response appropriately according to the estimation result. In this paper, we model the UIS in dialogues, taking movie recommendation dialogues as examples, and construct a dialogue system that chang…
▽ More
Intelligent dialogue systems are expected as a new interface between humans and machines. Such an intelligent dialogue system should estimate the user's internal state (UIS) in dialogues and change its response appropriately according to the estimation result. In this paper, we model the UIS in dialogues, taking movie recommendation dialogues as examples, and construct a dialogue system that changes its response based on the UIS. Based on the dialogue data analysis, we model the UIS as three elements: knowledge, interest, and engagement. We train the UIS estimators on a dialogue corpus with the modeled UIS's annotations. The estimators achieved high estimation accuracy. We also design response change rules that change the system's responses according to each UIS. We confirmed that response changes using the result of the UIS estimators improved the system utterances' naturalness in both dialogue-wise evaluation and utterance-wise evaluation.
△ Less
Submitted 5 December, 2020;
originally announced December 2020.
-
Reverse Operation based Data Augmentation for Solving Math Word Problems
Authors:
Qianying Liu,
Wenyu Guan,
Sujian Li,
Fei Cheng,
Daisuke Kawahara,
Sadao Kurohashi
Abstract:
Automatically solving math word problems is a critical task in the field of natural language processing. Recent models have reached their performance bottleneck and require more high-quality data for training. We propose a novel data augmentation method that reverses the mathematical logic of math word problems to produce new high-quality math problems and introduce new knowledge points that can b…
▽ More
Automatically solving math word problems is a critical task in the field of natural language processing. Recent models have reached their performance bottleneck and require more high-quality data for training. We propose a novel data augmentation method that reverses the mathematical logic of math word problems to produce new high-quality math problems and introduce new knowledge points that can benefit learning the mathematical reasoning logic. We apply the augmented data on two SOTA math word problem solving models and compare our results with a strong data augmentation baseline. Experimental results show the effectiveness of our approach. We release our code and data at https://github.com/yiyunya/RODA.
△ Less
Submitted 10 November, 2021; v1 submitted 4 October, 2020;
originally announced October 2020.
-
Minimize Exposure Bias of Seq2Seq Models in Joint Entity and Relation Extraction
Authors:
Ranran Haoran Zhang,
Qianying Liu,
Aysa Xuemo Fan,
Heng Ji,
Daojian Zeng,
Fei Cheng,
Daisuke Kawahara,
Sadao Kurohashi
Abstract:
Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These introduce exposure bias, which may cause the models overfit…
▽ More
Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These introduce exposure bias, which may cause the models overfit to the frequent label combination, thus deteriorating the generalization. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at https://github.com/WindChimeRan/OpenJERE .
△ Less
Submitted 6 October, 2020; v1 submitted 16 September, 2020;
originally announced September 2020.
-
A System for Worldwide COVID-19 Information Aggregation
Authors:
Akiko Aizawa,
Frederic Bergeron,
Junjie Chen,
Fei Cheng,
Katsuhiko Hayashi,
Kentaro Inui,
Hiroyoshi Ito,
Daisuke Kawahara,
Masaru Kitsuregawa,
Hirokazu Kiyomaru,
Masaki Kobayashi,
Takashi Kodama,
Sadao Kurohashi,
Qianying Liu,
Masaki Matsubara,
Yusuke Miyao,
Atsuyuki Morishima,
Yugo Murawaki,
Kazumasa Omura,
Haiyue Song,
Eiichiro Sumita,
Shinji Suzuki,
Ribeka Tanaka,
Yu Tanaka,
Masashi Toyoda
, et al. (4 additional authors not shown)
Abstract:
The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-…
▽ More
The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.
△ Less
Submitted 11 October, 2020; v1 submitted 27 July, 2020;
originally announced August 2020.
-
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
Authors:
Zhuoyuan Mao,
Fabien Cromieres,
Raj Dabre,
Haiyue Song,
Sadao Kurohashi
Abstract:
Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel co…
▽ More
Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel corpora. However, they do not account for linguistic information obtained using syntactic analyzers which is known to be invaluable for several Natural Language Processing (NLP) tasks. To this end, we propose JASS, Japanese-specific Sequence to Sequence, as a novel pre-training alternative to MASS for NMT involving Japanese as the source or target language. JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training which focuses on Japanese linguistic units called bunsetsus. In our experiments on ASPEC Japanese--English and News Commentary Japanese--Russian translation we show that JASS can give results that are competitive with if not better than those given by MASS. Furthermore, we show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods indicating their complementary nature. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation
Authors:
Haiyue Song,
Raj Dabre,
Zhuoyuan Mao,
Fei Cheng,
Sadao Kurohashi,
Eiichiro Sumita
Abstract:
Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks in low-resource settings. However, large monolingual corpora might not always be available for the languages of interest (LOI). To this end, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. A case s…
▽ More
Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks in low-resource settings. However, large monolingual corpora might not always be available for the languages of interest (LOI). To this end, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora, respectively, for S2S pre-training. We further show how to utilize script mapping (Chinese to Japanese) to increase the similarity between the two monolingual corpora leading to further improvements in translation quality. Additionally, we propose simple data-selection techniques to be used prior to pre-training that significantly impact the quality of S2S pre-training. An empirical comparison of our proposed methods reveals that leveraging assisting language monolingual corpora, data selection and script mapping are extremely important for NMT pre-training in low-resource scenarios.
△ Less
Submitted 22 January, 2020;
originally announced January 2020.
-
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Authors:
Haiyue Song,
Raj Dabre,
Atsushi Fujita,
Sadao Kurohashi
Abstract:
Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine tr…
▽ More
Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.
△ Less
Submitted 13 January, 2020; v1 submitted 25 December, 2019;
originally announced December 2019.
-
Emotion helps Sentiment: A Multi-task Model for Sentiment and Emotion Analysis
Authors:
Abhishek Kumar,
Asif Ekbal,
Daisuke Kawahra,
Sadao Kurohashi
Abstract:
In this paper, we propose a two-layered multi-task attention based neural network that performs sentiment analysis through emotion analysis. The proposed approach is based on Bidirectional Long Short-Term Memory and uses Distributional Thesaurus as a source of external knowledge to improve the sentiment and emotion prediction. The proposed system has two levels of attention to hierarchically build…
▽ More
In this paper, we propose a two-layered multi-task attention based neural network that performs sentiment analysis through emotion analysis. The proposed approach is based on Bidirectional Long Short-Term Memory and uses Distributional Thesaurus as a source of external knowledge to improve the sentiment and emotion prediction. The proposed system has two levels of attention to hierarchically build a meaningful representation. We evaluate our system on the benchmark dataset of SemEval 2016 Task 6 and also compare it with the state-of-the-art systems on Stance Sentiment Emotion Corpus. Experimental results show that the proposed system improves the performance of sentiment analysis by 3.2 F-score points on SemEval 2016 Task 6 dataset. Our network also boosts the performance of emotion analysis by 5 F-score points on Stance Sentiment Emotion Corpus.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Automatically Neutralizing Subjective Bias in Text
Authors:
Reid Pryzant,
Richard Diehl Martinez,
Nathan Dass,
Sadao Kurohashi,
Dan Jurafsky,
Diyi Yang
Abstract:
Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity - introducing attitudes via framing, presupposing truth, and casting doubt - remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically br…
▽ More
Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity - introducing attitudes via framing, presupposing truth, and casting doubt - remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view ("neutralizing" biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque CONCURRENT system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable MODULAR algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.
△ Less
Submitted 12 December, 2019; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Minimally Supervised Learning of Affective Events Using Discourse Relations
Authors:
Jun Saito,
Yugo Murawaki,
Sadao Kurohashi
Abstract:
Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations. Our method is simple and only requires a very sm…
▽ More
Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations. Our method is simple and only requires a very small seed lexicon and a large raw corpus. Our experiments using Japanese data show that our method learns affective events effectively without manually labeled data. It also improves supervised learning results when labeled data are small.
△ Less
Submitted 28 December, 2019; v1 submitted 2 September, 2019;
originally announced September 2019.
-
FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance
Authors:
Wataru Sakata,
Tomohide Shibata,
Ribeka Tanaka,
Sadao Kurohashi
Abstract:
Frequently Asked Question (FAQ) retrieval is an important task where the objective is to retrieve an appropriate Question-Answer (QA) pair from a database based on a user's query. We propose a FAQ retrieval system that considers the similarity between a user's query and a question as well as the relevance between the query and an answer. Although a common approach to FAQ retrieval is to construct…
▽ More
Frequently Asked Question (FAQ) retrieval is an important task where the objective is to retrieve an appropriate Question-Answer (QA) pair from a database based on a user's query. We propose a FAQ retrieval system that considers the similarity between a user's query and a question as well as the relevance between the query and an answer. Although a common approach to FAQ retrieval is to construct labeled data for training, it takes annotation costs. Therefore, we use a traditional unsupervised information retrieval system to calculate the similarity between the query and question. On the other hand, the relevance between the query and answer can be learned by using QA pairs in a FAQ database. The recently-proposed BERT model is used for the relevance calculation. Since the number of QA pairs in FAQ page is not enough to train a model, we cope with this issue by leveraging FAQ sets that are similar to the one in question. We evaluate our approach on two datasets. The first one is localgovFAQ, a dataset we construct in a Japanese administrative municipality domain. The second is StackExchange dataset, which is the public dataset in English. We demonstrate that our proposed method outperforms baseline methods on these datasets.
△ Less
Submitted 23 May, 2019; v1 submitted 7 May, 2019;
originally announced May 2019.
-
A Multi-task Ensemble Framework for Emotion, Sentiment and Intensity Prediction
Authors:
Md Shad Akhtar,
Deepanway Ghosal,
Asif Ekbal,
Pushpak Bhattacharyya,
Sadao Kurohashi
Abstract:
In this paper, through multi-task ensemble framework we address three problems of emotion and sentiment analysis i.e. "emotion classification & intensity", "valence, arousal & dominance for emotion" and "valence & arousal} for sentiment". The underlying problems cover two granularities (i.e. coarse-grained and fine-grained) and a diverse range of domains (i.e. tweets, Facebook posts, news headline…
▽ More
In this paper, through multi-task ensemble framework we address three problems of emotion and sentiment analysis i.e. "emotion classification & intensity", "valence, arousal & dominance for emotion" and "valence & arousal} for sentiment". The underlying problems cover two granularities (i.e. coarse-grained and fine-grained) and a diverse range of domains (i.e. tweets, Facebook posts, news headlines, blogs, letters etc.). The ensemble model aims to leverage the learned representations of three deep learning models (i.e. CNN, LSTM and GRU) and a hand-crafted feature representation for the predictions. Experimental results on the benchmark datasets show the efficacy of our proposed multi-task ensemble frameworks. We obtain the performance improvement of 2-3 points on an average over single-task systems for most of the problems and domains.
△ Less
Submitted 15 October, 2018; v1 submitted 3 August, 2018;
originally announced August 2018.
-
Neural Adversarial Training for Semi-supervised Japanese Predicate-argument Structure Analysis
Authors:
Shuhei Kurita,
Daisuke Kawahara,
Sadao Kurohashi
Abstract:
Japanese predicate-argument structure (PAS) analysis involves zero anaphora resolution, which is notoriously difficult. To improve the performance of Japanese PAS analysis, it is straightforward to increase the size of corpora annotated with PAS. However, since it is prohibitively expensive, it is promising to take advantage of a large amount of raw corpora. In this paper, we propose a novel Japan…
▽ More
Japanese predicate-argument structure (PAS) analysis involves zero anaphora resolution, which is notoriously difficult. To improve the performance of Japanese PAS analysis, it is straightforward to increase the size of corpora annotated with PAS. However, since it is prohibitively expensive, it is promising to take advantage of a large amount of raw corpora. In this paper, we propose a novel Japanese PAS analysis model based on semi-supervised adversarial training with a raw corpus. In our experiments, our model outperforms existing state-of-the-art models for Japanese PAS analysis.
△ Less
Submitted 4 June, 2018; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Knowledge-enriched Two-layered Attention Network for Sentiment Analysis
Authors:
Abhishek Kumar,
Daisuke Kawahara,
Sadao Kurohashi
Abstract:
We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised mod…
▽ More
We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised model based on Support Vector Regression using a Multilayer Perceptron network for sentiment analysis. We evaluate our model on the benchmark dataset of SemEval 2017 Task 5. Experimental results show that the proposed model surpasses the top system of SemEval 2017 Task 5. The model performs significantly better by improving the state-of-the-art system at SemEval 2017 Task 5 by 1.7 and 3.7 points for sub-tracks 1 and 2 respectively.
△ Less
Submitted 15 June, 2018; v1 submitted 20 May, 2018;
originally announced May 2018.
-
MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing
Authors:
Raj Dabre,
Sadao Kurohashi
Abstract:
Multilinguality is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages help improve the results in many Natural Language Processing tasks. Multilingual Multiway Corpora (MMC) contain the same sentence in multiple languages. Such corpora have been primarily used for Multi-Source and Pivot Language Machine Translation but…
▽ More
Multilinguality is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages help improve the results in many Natural Language Processing tasks. Multilingual Multiway Corpora (MMC) contain the same sentence in multiple languages. Such corpora have been primarily used for Multi-Source and Pivot Language Machine Translation but are also useful for developing multilingual sequence taggers by transfer learning. While these corpora are available, they are not organized for multilingual experiments and researchers need to write boilerplate code every time they want to use said corpora. Moreover, because there is no official MMC collection it becomes difficult to compare against existing approaches. As such we present our work on creating a unified and systematically organized repository of MMC spanning a large number of languages. We also provide training, development and test splits for corpora where official splits are unavailable. We hope that this will help speed up the pace of multilingual NLP research and ensure that NLP researchers obtain results that are more trustable since they can be compared easily. We indicate corpora sources, extraction procedures if any and relevant statistics. We also make our collection public for research purposes.
△ Less
Submitted 14 February, 2019; v1 submitted 3 October, 2017;
originally announced October 2017.
-
Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages
Authors:
Raj Dabre,
Fabien Cromieres,
Sadao Kurohashi
Abstract:
In this paper, we explore a simple solution to "Multi-Source Neural Machine Translation" (MSNMT) which only relies on preprocessing a N-way multilingual corpus without modifying the Neural Machine Translation (NMT) architecture or training procedure. We simply concatenate the source sentences to form a single long multi-source input sentence while keeping the target side sentence as it is and trai…
▽ More
In this paper, we explore a simple solution to "Multi-Source Neural Machine Translation" (MSNMT) which only relies on preprocessing a N-way multilingual corpus without modifying the Neural Machine Translation (NMT) architecture or training procedure. We simply concatenate the source sentences to form a single long multi-source input sentence while keeping the target side sentence as it is and train an NMT system using this preprocessed corpus. We evaluate our method in resource poor as well as resource rich settings and show its effectiveness (up to 4 BLEU using 2 source languages and up to 6 BLEU using 5 source languages). We also compare against existing methods for MSNMT and show that our solution gives competitive results despite its simplicity. We also provide some insights on how the NMT system leverages multilingual information in such a scenario by visualizing attention.
△ Less
Submitted 3 March, 2019; v1 submitted 20 February, 2017;
originally announced February 2017.
-
An Empirical Comparison of Simple Domain Adaptation Methods for Neural Machine Translation
Authors:
Chenhui Chu,
Raj Dabre,
Sadao Kurohashi
Abstract:
In this paper, we propose a novel domain adaptation method named "mixed fine tuning" for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with art…
▽ More
In this paper, we propose a novel domain adaptation method named "mixed fine tuning" for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.
△ Less
Submitted 3 February, 2017; v1 submitted 11 January, 2017;
originally announced January 2017.
-
Reading Comprehension using Entity-based Memory Network
Authors:
Xun Wang,
Katsuhito Sudoh,
Masaaki Nagata,
Tomohide Shibata,
Daisuke Kawahara,
Sadao Kurohashi
Abstract:
This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the…
▽ More
This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities. Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities. We formulated several different tasks as question answering problems and tested the proposed model. Experiments reported satisfying results.
△ Less
Submitted 1 February, 2017; v1 submitted 12 December, 2016;
originally announced December 2016.