Mini-Omni2: Towards Open-Source GPT-4o With Vision, Speech and Duplex Capabilities

Mini-Omni2: Towards Open-source GPT-4o with
Vision, Speech and Duplex Capabilities
Zhifei Xie♠♣ Changqiao Wu♠

xzf24@mails.tsinghua.edu.cn wuchangqiao@inspirai.com
arXiv:2410.11190v2 [eess.AS] 16 Oct 2024
♠ ♣
Inspirai Tsinghua University
https://github.com/gpt-omni/mini-omni2
Abstract
GPT-4o, an all-encompassing model, represents a milestone in the development of
large multi-modal language models. It can understand visual, auditory, and textual
modalities, directly output audio, and support flexible duplex interaction. Models
from the open-source community often achieve some functionalities of GPT-4o,
such as visual understanding and voice chat. Nevertheless, training a unified model
that incorporates all modalities is challenging due to the complexities of multi-
modal data, intricate model architectures, and training processes. In this paper, we
introduce Mini-Omni2, a visual-audio assistant capable of providing real-time,
end-to-end voice responses to visoin and audio queries. By integrating pretrained
visual and auditory encoders, Mini-Omni2 maintains performance in individual
modalities. We propose a three-stage training process to align modalities, allowing
the language model to handle multi-modal inputs and outputs after training on
a limited dataset. For interaction, we introduce a command-based interruption
mechanism, enabling more flexible interaction with users. To the best of our
knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which
have similar form of functionality, and we hope it can offer valuable insights for
subsequent research.
“ Are you at seaside ? ”
Streaming 0 1 2 ...
output
Text tokens Streaming Audio Decoding
Audio tokens Output stream control
Status tokens “Stop Omni ”

Parrallel
···
generation
Mini-Omni Language Modeling
Audio Adapter Embedding Vision Adapter Embedding
Audio Encoder❄ Vision Encoder ❄ Omni
wav text image listen
Figure 1: The Mini-Omni2 model architecture.
Technical report.
1 Introduction
GPT-4o[Openai, 2024a] represents a milestone in the development of multi-modal large language
models, particularly evident in three aspects: (1) its powerful capabilities in multi-modal question an-
swering; (2) its ability to transcend traditional text-based input and output, enabling the understanding
and generation of multi-modal content; and (3) its flexible interaction mode with interruption mecha-
nisms, which facilitates a more natural and fluid human-computer interaction. However, the GPT-4o
model is not open-sourced to the public, and its technical specifications remain undisclosed. To date,
mainstream methods predominantly involve employing various pre-trained encoders to obtain textual
outputs for specific modalities, such as visual and audio inputs, and utilizing cascading techniques to
integrate a text-to-speech (TTS) module that replicates GPT-4o’s speech output capabilities, thereby
simulating its multi-modal functionalities. Achieving end-to-end multi-modal understanding and
output remains a challenging task.
Recently, as the capabilities of language models such as Llama3.2[meta, 2024] continue to expand,
researchers have begun to explore multi-modal approaches to achieve the performance of GPT-
4o. However, these research outcomes often focus only on specific functionalities of GPT-4o,
such as vision-text understanding (LLava[Liu et al., 2024], Flamingo[Alayrac et al., 2022]), audio
comprehension (Qwen2-audio[Chu et al., 2024]), multi-modal understanding (VITA[Fu et al., 2024]),
and speech-to-speech dialogue (Mini-Omni[Xie and Wu, 2024], Llama-Omni[Fang et al., 2024],
Moshi[Défossez et al., 2024]). However, integrating text, vision, and speech modalities remain
challenging.
In our view, the current challenges in achieving interaction across three modalities involve the
following aspects: (1) Model capability — GPT-4o requires a unified model that comprehensively
understands all modalities while maintaining robust performance across wide range of tasks; (2)
direct inference output capabilities in multi-modal contexts — our recent work Mini-Omni[Xie
and Wu, 2024] has addressed how to enhance the model’s streaming output abilities in audio, laying
the groundwork for Mini-Omni2’s voice interaction capabilities; (3) substantial data requirements
— training for GPT-4o necessitates the integration of data across visual, audio, and textual modalities,
with quantities increasing exponentially compared to previous efforts; (4) the design of flexible
interaction methods — GPT-4o’s full-duplex capability is also a notable feature.
In this paper, we introduce Mini-Omni2 as a continuation of Mini-Omni, employing a single model
to end-to-end simulate the visual, speech, and textual capabilities of GPT-4o, enhanced by a unique
command-based interruption mechanism. Consistent with Mini-Omni, we retain Qwen2[Yang et al.,
2024] as the foundational model, leveraging this compact architecture to achieve comprehensive
multi-modal understanding and real-time streaming speech inference across the three modalities.
Furthermore, we enable the model to receive external audio inputs in real time, simulating its
"auditory" perception and controlling the speech output stream based on content semantics. The
model architecture of Mini-Omni2 is illustrated in Figure 1. As an end-to-end model, we enhance
data utilization efficiency and demonstrate the generalizability of the Mini-Omni2 algorithm by
directly employing the classic pre-trained visual encoder CLIP[Radford et al., 2021] and the encoder
component of the speech recognition model Whisper[Radford et al., 2023] as feature extractors for
visual and audio inputs. The features from the pre-trained encoders and the text embedding are
concatenated to form the model’s input. Due to challenges related to understanding capabilities, we
did not adopt a token-in-token-out paradigm. Moreover, utilizing a delayed parallel output approach
for text and audio, the model can response instantly with audio like GPT-4o.
In Mini-Omni2, we propose an efficient training approach based on a limited amount of data, aiming
to enable the model’s training methods to assist other multi-modal models in modality expansion.
Thus, we avoided blindly expanding the dataset exponentially and instead sought to develop a multi-
modal extension method using minimal new data. We employed a three-phase training process
for modality expansion, alignment, and joint training. Initially, the Mini-Omni2 model underwent
adapter training using speech recognition and image caption datasets, thereby broadening the scope
of multi-modal understanding. Next, Mini-Omni2 was trained for text output in question-answering
tasks across modalities, allowing the adapter-based output features to align with text embedding for
effective question answering. In the third phase, we focused on multi-modal output capability by
incorporating audio output and training for auditory capabilities like interruption.
With respect to the model’s capabilities in voice interaction, Mini-Omni2 continues to utilize the
SNAC tokenizer[Siuzdak, 2024] to ensure high-quality speech output. However, based on our
2
Figure 2: Mini-Omni2 now supports streaming speech responses for image, audio and text inputs.
observations, we believe that the current full-duplex training is still not sufficiently stable. Therefore,
we contend that interruptions based on input semantic information are essential for achieving stable
and flexible human-machine interaction. We enable the model to perform real-time encoding of
its received "auditory" waveforms using SNAC, generating tokens that allow it to control its own
output during each generation. As a demonstration, we construct data using the phrase "stop omni,"
employing frame-level irq and n-irq special tokens to control the generating process.
To evaluate the multi-modal interaction capabilities of Mini-Omni2, we first empirically tested its
performance on traditional visual and auditory tasks, verifying that the model maintains consistency
with the original model in basic tasks such as image caption and speech recognition. Next, we
conducted a series of additional experiments to test the model’s response speed and perform some
case studies.
In summary, we make the following contributions:
• We introduce Mini-Omni2, the first open-source multi-modal language model with capabilities in
vision, speech, text and an auditory interruption mechanism. To the best of our knowledge, it is one
of the most similar end-to-end models to the GPT-4o’s functionalities. Figure 2 shows the demo of
the model as a visual voice assistant.
• We propose a novel training pipeline based on the modal expansion method from the previous
Mini-Omni. This pipeline encompasses three training phases, allowing the text model to first align
responses to multi-modal inputs, and ultimately extend outputs to the speech modality in the final
phase, employing a delayed parallel generation algorithm for real-time speech output.
• We explored a command-based interruption method, utilizing streaming tokens as input and
constructing training data to enable the model to control its audio output stream based on external
semantic cues. And all the synthetic data will be open-sourced.
2 Related Work
Large Vision Language Models Recent vision-language models are developing rapidly and were
among the first modalities to combine with large language models. The foundational work began
with CLIP[Radford et al., 2021], which is also used as vision encoder in our work. Subsequent works
typically employ a vision encoder, an adapter as an intermediate layer, and a large language model
as the architecture to enable the LLM to understand and reason about visual inputs. Classic works
include BLIP[Li et al., 2022], BLIP2[Li et al., 2022], Llava[Liu et al., 2024], Qwen-VL[Bai et al.,
2023], Qwen2-VL[Wang et al., 2024], InstructBLIP[Dai et al., 2023], MiniGPT-4[Zhu et al., 2023],
GPT-4V[Openai, 2024b] from OpenAI, Gemini[Google, 2024] from Google, and Llama-3.2[meta,
2024] from Meta. Researchers are also exploring other directions, such as higher-resolution vision
3
encoders like InternLM-XComposer2-4KHD[Dong et al., 2024] and using MOE architectures, as in
works like CogVLM[Wang et al., 2023b]. The method used in this paper is the most classical, which
is similar to Llava[Liu et al., 2024].
Audio Language Modeling With the further development of large multi-modal models, speech
signals have also been discretized into tokens, enabling understanding and reasoning in a manner
similar to text models. Important works include speech synthesis models like VALL-E[Wang et al.,
2023a], music generation models like MusicGen[Copet et al., 2024], as well as voice interaction works
like AudioPaLM[Rubenstein et al., 2023] and LauraGPT[Chen et al., 2023]. Just recently, researchers
have explored methods for speech-to-speech interaction, with works such as Mini-Omni[Xie and Wu,
2024], Llama-Omni[Fang et al., 2024], and Moshi[Défossez et al., 2024]. Speech tokenization is
also an important direction for generating stable and information-rich tokens, with recent works like
Speechtokenizer[Zhang et al., 2023b], Google USM[Zhang et al., 2023c], and EnCodec[Défossez
et al., 2022].
Multi-modal Interaction Model With the emergence of GPT-4o, researchers have begun working
on end-to-end multi-modal models for voice chat. Early works include Spectron[Nachmani et al.,
2023] and SpeechGPT[Zhang et al., 2023a], which use the A-T-T-A method to achieve speech-in
and speech-out in an end-to-end manner. Mini-Omni[Xie and Wu, 2024] introduced a method for
parallel generation of text and audio, enabling the model to directly start reasoning in audio. Both
Moshi[Défossez et al., 2024] and Llama-Omni[Fang et al., 2024] used similar approaches. LSLM[Ma
et al., 2024] and Moshi explored the full duplex interaction capability by combining the speaking
and listening signals as input. VITA[Fu et al., 2024] can understand all modalities but only outputs
text. The AnyGPT[Zhan et al., 2024] project aims to achieve full multi-modal understanding and
generation. This work is a continuation of Mini-Omni, aiming to realize multi-modal input and
low-latency parallel speech-text output with duplex capability.
3 Mini-Omni2
The model architecture of Mini-Omni2 is illustrated in Figure 1. In addition to the text embedding
module, Mini-Omni2 employs the visual component of CLIP and Whisper-small as encoders for
visual and auditory modalities, resulting in highly efficient data utilization during training and
minimizing extensive pre-training efforts. Additionally, Mini-Omni2 features real-time duplex
capability, providing greater flexibility in model interactions. This section includes 3.1, which
discusses the model architecture; 3.2, which presents the modeling methods for input and output
streams; and sections 3.3 and 3.4, which detail training methods and interuption manner, respectively.
3.1 Architecture
Visual Encoder - We utilize the visual component of CLIP, specifically the ViT-B/32 model, as
the visual encoder, which converts incoming images to a feature sequence of length 49 for the
image patches and a global semantic feature. Mini-Omni2 concatenates these to form a raw feature
sequence of length 50, employing a single-layer LlamaMLP[Touvron et al., 2023] as the vision
adapter.
Audio Encoder - In the encoder section, we continue our previous work by using the Whisper-small
model as the audio encoder. We opted not to adopt a token-in-token-out modeling approach for audio
input and output for two reasons. (i) Strong semantic alignment in speech recognition. The Whisper
model, proposed by OpenAI, is trained on thousands of hours of datasets, demonstrating exceptional
robustness. Furthermore, we unexpectedly found that Mini-Omni exhibits an understanding of
Chinese data, despite not being trained on any Chinese datasets. We believe this is due to the Whisper
model’s capability to automatically align audio from different languages, tones, and noise levels that
convey the same meaning, thereby enabling the model to focus on the user’s intention. (ii) Unstable
open-source audio tokens. We observed a phenomenon where a) the audio loss of Mini-Omni2
remains high during training, and b) the tokens for a segment of audio can vary significantly based
on the content at both ends. We argue that tokens are insufficient for reliably conveying the content
of speech input, as evidenced by the poor performance of ASR comparing to semantic features like
Whisper.
Language Model - Mini-Omni2 uses the Qwen2-0.5B base version as its foundational language
model. We have ported the Llama-based Qwen2 model using the LitGPT[AI, 2023] training frame-
work, employing the configuration of the 0.5B model as the base language model. For the parallel
4
Mini-Omni2
Omni LM head
(a Visual-QA[Audio]
Mini-Omni2
Omni LM head
(b Visual-QA[Text]
Mini-Omni2
Omni LM head
(c QA[Audio]
Mini-Omni2
Omni LM head
(d Audio input text response

Mini-Omni2
Omni LM head
(e QA[Text]
Mini-Omni2
Omni LM head
(f Streaming control
audio feature visual feature text token <spt> | status <pad>
Figure 3: Schematic diagram of multi-layer tokens for input and output of the main task model of
Mini-Omni2.
5
generation of the multi-layer codebook shown in Figure 3, we expanded the vocabulary of the Qwen2
model by adding 7 × 4160 sub-LM-heads, as illustrated in Figure 4, resulting in a vocabulary size of
181,120.
Figure 4: The special multi-layer vocabulary construction of Mini-Omni2.
3.2 Multimodal Languague Modeling

Multimodal Modeling - Consider Y = (yi ∈ Vtxt | i = 1, . . . , ttxt ) as a text utterance from a
Qttxt
vocabulary Vtxt with length ttxt . The probability of Y can be expressed as p(Y ) = i=1 p(yi |
y1 , . . . , yi−1 ). Now, when dealing with a continuous speech signal, we can convert it into discrete
speech tokens (dst), represented as D = (di ∈ Vdst |i = 1, · · · , tdst ) using a audio tokenizer. In this
context Vdst is the vocabulary of discrete speech tokens. These discrete speech tokens can be treated
as spoken language within Vdst and modeled in a manner similar to text. We combine text and speech
in a new vocabulary Vvoxt by Vvoxt = Vtxt ∪ Vdst . Additionally, we introduce visual features V ∈ Fvis ,
where Fvis represents the continuous features extracted from the image. Therefore, we can model the
probability of both speech, text, where Z = (zi ∈ V|i = 1, · · · , t). This probability is expressed as
Qt
p(Z) = i=1 p(zi | z1 , · · · , zi−1 , V ), where Z represents discrete speech tokens D(V = Vdst ), text
tokens Y (V = Vtxt ), and continuous video features V (F = Fvis ), or various combinations of Y , D,
and V . For the audio and text tokens generated simultaneously, the negative log-likelihood loss can
be formulated as in Equation (1).
nj
m X
X
L(T, A, V |C) = log P (Ti,j , Ai,j |T<i,j , A<i,j , Vj ; Xj ) (1)
j=1 i=1
where T , A are the text-audio output pairs in the training corpus C, and m is the number of training
examples. Xj and Vj is the input condition of the j-th example, nj is the max number of tokens of
sample Tj and Aj , and Ti,j and Ai,j represent the i-th text token and audio token of the j-th sample.
Multi-modal token-Mixed Input - The modeling of input and output tokens for some of the model’s
main tasks is illustrated in Figure 3. In this section, we will discuss the model’s inputs and outputs.
Since the model incorporates multiple LM-heads, it generates multiple sequences in an auto-regressive
manner. As a result, the model also takes multiple sequences as inputs. The input sequences can
include a mixed input from a minimum of one modality to a maximum of three modalities. In this
subsection, we will discuss the methods for modality mixing.
• Visual-[Audio|Text] Input Our experiments indicate that the Transformer architecture is easier to
train and generates more natural responses when auto-regressive tasks are connected with semantic
information. Therefore, as shown in Figure 3 (a), we first place the visual features processed by the
vision adapter, followed by the Whisper features processed by the audio adapter. Finally, at the
position where a response needs to be generated auto-regressively, we place a special token for the
response. The total length is approximately 50(CLIP feature length) + La (Whisper feature length).
• Single Modality Input Single-modal inputs may consist of visual, speech, or text inputs. We
place the features of both visual and audio modalities across layers 1 to 7. These features will be
replicated to enhance their prominence when averaged across all layer features. Notably, when
only a single modality’s features are input without the control of a special token, the default tasks
are image caption, speech-to-text question answering, and text-to-text question answering.
Text-Audio Parrallel Decoding In Mini-Omni2, we essentially retain the output strategy of Mini-
Omni, employing the Text-Instruct Delay Parallel Decoding algorithm to enhance audio generation.
This approach utilizes text-audio parallel decoding to simultaneously generate audio and text tokens,
leveraging text-to-speech synthesis for real-time output. We continue the parallel generation method
introduced by MusicGen[Copet et al., 2024], utilizing SNAC as the audio encoder, which comprises
seven complementary token layers. In a single step, we generate eight tokens, including text, while
maintaining a one-step delay between layers. Furthermore, we incorporate a Batch approach that
6
Mini-Omni Mini-Omni Mini-Omni Mini-Omni
❄ ❄ � �
Audio Adapter � Vision Adapter � Audio Adapter ❄ Vision Adapter❄ Embedding Audio Adapter❄
Audio Encoder❄ Vision Encoder ❄ Audio Encoder❄ Vision Encoder ❄ Audio Encoder❄
wav image wav image text wav

stage 1 Multimodal Encoder Adaptation stage 2 Modality Alignment with Languague Model
Mini-Omni Mini-Omni Mini-Omni

� � �
Vision Adapter� Embedding Audio Adapter� Audio Adapter � Embedding Embedding

Audio Adapter
Vision Encoder ❄ Audio Encoder ❄ Audio Encoder❄
image text wav wav text wav

stage 3 Post-Training of the Model's Multimodal Output Capabilities / Semantic Interruption
Figure 5: Mini-Omni2’s three-stage training phases
involves two samples: one requiring both text and audio responses and the other necessitating a
text-only response. By discarding the text token from the first sample and embedding the output from
the second sample into the first, we effectively transfer the model’s text-based capabilities to audio
tasks, significantly enhancing reasoning abilities with minimal resource overhead. We have provided
detailed explanations of the specific technical details in the Mini-Omni[Xie and Wu, 2024].
Overall, we have introduced our modeling approach for three-modal inputs and two-modal outputs
within a single model. Through these methods, the model can accomplish eight reasonable multi-
modal tasks, with some of the primary tasks illustrated in Figure 3, showcasing all the multilayer
tokens generated during a single inference.
3.3 Training Strategies

In this section, we will introduce the training phase of the Mini-Omni2 model. The overall training
process of Mini-Omni2 is illustrated in Figure 5. The training process is divided into three stages,
with multitask training employed in each stage. In the figure, except for Stage 1, a foundational
text-to-text task is additionally incorporated but not explicitly depicted. We categorize the entire
training process into three stages:
• Multimodal Encoder Adaptation In the first stage, we employ a rapid, small-scale training focused
solely on the weights of the linear layer connecting the language model and the encoder. The
objective of Stage 1 is to ensure that the multi-modal features received by the model closely
resemble the characteristics of text tokens as represented in the model’s embedding layer. We
believe this approach offers two primary advantages: 1. It allows the model to concentrate on logical
reasoning in modality-specific question answering during subsequent training. 2. It minimizes the
parameter changes in the language model’s core that would otherwise result from adapting to other
modalities.
• Modality Alignment In stage 2, the primary task of the model training is to transfer the question-
answering ability based on text input to question-answering abilities based on images and audio. In
this step, the adapters trained in stage 1 are temporarily frozen, and the weights of the language
model are involved in the training. At this stage, all tasks do not involve audio responses. For tasks
like image-based and audio-based QA, only text-based responses are generated to establish the
7
model’s foundational logical capabilities. The speech output is simply an extension of this logical
ability into different modalities.
• Post training In Stage 3, the task of the model is to extend the output modality to include audio
response generation. As shown in Figure 5, the model will be trained on all tasks from Stage 1 and
Stage 2, with audio token outputs for all question-answering tasks. Additionally, the model will
learn interruption mechanism, an algorithm introduced in the next section.
3.4 Duplex Interaction
A real-time conversation model needs to have duplex capability in order to enable more flexible
interactions. However, this interruption mechanism should not be a simple VAD (Voice Activity
Detection)-based one, but rather a system that can determine whether the user intends to interrupt the
model. Additionally, the model’s ability should be highly robust, capable of handling various external
situations (e.g., noise, other conversations, and unrelated sounds). We explore this functionality with
command-based task, where the model will stop talking immediately when user speaks "Stop Omni".
Furthermore, this approach can be naturally extended to incorporate more sophisticated semantic
interruption mechanisms through the development of more contextually appropriate interruption
datasets.
Background Noise Selection: (1) We randomly utilized a variety of speech recognition samples
from the Libri-tts dataset as the original human noise data samples. (2) We employed samples from
the MUSAN[Snyder et al., 2015] dataset, which includes music, human voices, white noise, and
urban noise.
Semantic Interruption Construction: We synthesized "Stop Omni" phrases with random voice
timbres, which were subsequently mixed with noise. The specific data construction methods are
introduced in the next section.
Combining the aforementioned data, the model will receive long sequences of data containing "Stop
Omni" phrases amidst various noises. The model will generate two types of state tokens in real time:
irq and n-irq, representing the intention of the user to interrupt and not to interrupt, respectively.
During inference, when the model output irq token, it will stop the generating process and start
to listen the new question. For this task, we use tokens as input to enhance the model’s real-time
processing capabilities.
4 Data and Evaluation

In this section, we introduce the data used for training Mini-Omni2 and present some initial evaluation
results. We will provide a more detailed explanation of the composition and construction process of
the data for each modality. In the experimental results section, we only showcase a few application
cases and basic capability assessments. More comprehensive experiments related text and vision
tasks will be updated shortly.
4.1 Datasets
The training data for the Mini-Omni2 model is primarily sourced from five components, as shown in
Table 1. (1) Textual Question-Answering Data: Throughout all training stages, whenever the language
model weights were unfrozen for training, textual question-answering data was included to maintain
the model’s reasoning ability. We used the first 1.5 million question-answer pairs from the Open-Orca
dataset. (2) Speech Recognition Data: Speech recognition data was used to continuously maintain
the model’s semantic understanding of external spoken input. We primarily utilized the LibriTTS,
VCTK, and Multilingual LibriSpeech datasets. (3) Spoken Question-Answering Data: We did not
use a standalone spoken dataset; instead, synthetic data was employed for training. The spoken
question-answering data was derived from the Moss-002-sft dataset. (4) Image Question-Answering
Data: We used 400,000 samples (caption and instruction) from the ALLaVA-4V dataset. (5) Voice
Assistant Data: To make the model’s responses more aligned with the style of voice assistants, we
continuously used the VoiceAssistant-400K dataset introduced in Mini-Omni.
4.2 Training Parameters
The Mini-Omni2 model completed all training steps on eight A100 GPUs. During the adapter
training stage, learning rates ranged from 2e-5 to 1e-3, while training the language model used
learning rates between 2e-6 and 2e-4. The final fine-tuning was conducted with learning rates ranging
8
Task Stages Dataset Modality items
Libritts [Zen et al., 2019] A1|T1 586 h
ASR 1,2,3 VCTK [datashare, 2024] A1|T1 44 h
Multilingual LibriSpeech [Pratap et al., 2020] A1|T1 8000h
Text QA 2,3 Open-Orca [OpenOrca] T1|T2 2000K
Audio QA 2,3 Moss-002-sft-data [Sun et al., 2024] A1|T1|A2|T2 1500K
Visual QA 2,3 ALLaVA-4V [Sun et al., 2024] V|A1|T1|A2|T2 800K
Alpaca-GPT4 [vicgalle, 2024] A1|T1|A2|T2 55k
Identity finetune [sayan1101, 2024] A1|T1|A2|T2 2k
QAassistant [Mihaiii, 2024a] A1|T1|A2|T2 27k
voice QA final Rlhf [Anthropic, 2024] A1|T1|A2|T2 367k
Trivia-singlechoice [Mihaiii, 2024c] A1|T1|A2|T2 17k
Trivia-Multichoice [Mihaiii, 2024b] A1|T1|A2|T2 20k
OpenAssistant [OpenAssistan, 2024] A1|T1|A2|T2 2k
Table 1: The datasets and their usage for training Mini-Omni2.
from 2e-6 to 2e-5. A cosine scheduler was employed, with 1,500 warm-up steps and a global batch
size of 192. Each stage was trained for one epoch using the full dataset. The scale of the vision and
audio encoders was described earlier, and the language model used was the Qwen2-0.5B base model.
All model adapters used Llama-MLP with an intermediate size of 4,864.
4.3 Data Construction

Spoken Dialogue Data: We used our speech recognition dataset as a random voice timbre library.
To ensure robust training, a random sample from this dataset was selected as a voice prompt for the
input of all spoken dialogue data, and CosyVoice[Du et al., 2024] was employed for zero-shot speech
synthesis. For the output of all question-answering data, the same voice timbre was used from an
internal TTS system.
Interruption Data: First, the noise data is stream-encoded and decoded to simulate real-time
streaming input to the model. Then, a random segment of the noise data is extracted. At the end of
this segment, a "Stop Omni" phrase is inserted, generated with a random voice timbre in the same
manner as the dialogue data. Finally, an additional "tail" of 0-10 seconds is appended to the end of
this segment. In terms of labeling, all data before the tail is labeled as "n-irq", while the tail segment
is labeled as "irq", indicating that the model should be interrupted.
4.4 Experimental Results

Currently, we provide the accuracy of Mini-Omni2 in speech recognition to evaluate the model’s
speech understanding ability, and we present some practical cases. For model experience and more
cases, please follow our github repositories.
Method test-clean test-other dev-clean dev-other

Wav2vec2-base [Baevski et al., 2020] 6.0 13.4 - -
VITA [Fu et al., 2024] 8.14 18.41 7.57 16.57
Whisper-small* 4.4 10.1 4.6 10.3
Mini-Omni 4.5 9.7 4.6 9.2
Mini-Omni2 4.8 9.8 4.7 9.4
Table 2: Comparison of the model’s ASR with the base model used. (* our reproduced evaluation
result.)
According to the speech recognition results in Table 2, it can be observed that the accuracy of
Mini-Omni2 shows a slight decline after adding the visual modality compared to Mini-Omni. This
phenomenon may be attributed to the relative reduction in the proportion of data. Moreover, in
comparison with the decoder of the whisper module employed by the model, the Mini-Omni2 model
9
outperforms Whisper on the librispeech-other dataset. This demonstrates that our training process
has enhanced the robustness of the model in speech recognition.
4.5 Case Study

Here we present some use cases from Mini-Omni2.
Sample 1 Sample 3
Q: what can you see? (Audio)

A: Based solely on the image, the focus is on the two golden retriever dogs, which are the central subjects. The
dog on the left is slightly out of focus, suggesting it is the main subject of the photograph. The dog on the right is in
sharp focus, with its mouth open, tongue visible, and the tongue extending out, which could indicate that it is panting
or smiling. The background is blurred, but it appears to be an outdoor setting with greenery, possibly a garden or a
park, which is typical for a dog's habitat. The lighting is bright and natural, suggesting it might be daytime.
Sample 2
Q: 'What does the sign say? and what should I do?' (Audio)
Q: Omni, What plant is this? (Audio)
A: The sign states "STOP" in bold, capital letters. To address the situation, you should stop at the intersection of
the two directions, which is indicated by the red arrow pointing to the right. If you are facing the intersection, you A: This is a cactus, a common desert plant. However, without more information or a closer view,
should pull over to the right and proceed with your exit. If you are facing the opposite direction, you should pull over I'm afraid I can't determine the exact species of the cactus.
to the left and proceed with your exit. It is important to note that the sign is a universal warning sign, so it is not
applicable to all situations.
Figure 6: Use cases of Mini-Omni2
5 Limitations
We believe the following aspects are worth exploring and improving: 1. Scaling of model and data
size. Mini-Omni2 aims to train small models with limited resources, and we believe that more data
and compute can greatly enhance its capabilities. 2. Improve style control and diversity of audio
output (emotion, naturalness, timbre, accent, and singing). 3. A richer mechanism for semantic
interruptions.
6 Conclusion
In this paper, we present Mini-Omni2, a unified multi-modal language model with capabilities in text,
speech, vision, end-to-end streaming audio output, and duplex interaction. Our goal is to reproduce an
open-source GPT-4o model, and to our best knowledge, our work is also one of the closest in terms of
functionality. We use multiple pretrained encoders as the vision and speech encoders and align them
with the language model to extend the modalities. Furthermore, we propose a three-phase modality
alignment and expansion training process to achieve the desired capabilities of the model. We also
explore a robust method for duplex interaction modeling and introduce our data construction and
interruption mechanism. All models and datasets will be open-sourced, and we hope Mini-Omni2
can serve as a reference for future research.
10
References
Lightning AI. Litgpt. https://github.com/Lightning-AI/litgpt, 2023.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language
model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,
2022.
Anthropic. https://huggingface.co/datasets/anthropic/hh-rlhf, 2024.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework
for self-supervised learning of speech representations. Advances in neural information processing
systems, 33:12449–12460, 2020.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang
Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.
arXiv preprint arXiv:2308.12966, 2023.
Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen
Wang, Siqi Zheng, et al. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv
preprint arXiv:2310.04673, 2023.
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv,
Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759,
2024.
Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and
Alexandre Défossez. Simple and controllable music generation. Advances in Neural Information
Processing Systems, 36, 2024.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang,
Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language
models with instruction tuning, 2023. URL https://arxiv.org/abs/2305.06500.
datashare. https://datashare.ed.ac.uk/handle/10283/2651, 2024.
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio
compression. arXiv preprint arXiv:2210.13438, 2022.
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou,
Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang,
Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering
large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint
arXiv:2404.06512, 2024.
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng,
Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer
based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024.
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni:
Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024.
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin,
Long Ma, Xiawu Zheng, et al. Vita: Towards open-source interactive omni multimodal llm. arXiv
Google. https://deepmind.google/technologies/gemini/, 2024.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-
training for unified vision-language understanding and generation. In International conference on
machine learning, pages 12888–12900. PMLR, 2022.
11
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction
tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 26296–26306, 2024.
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and
Xie Chen. Language model can listen while speaking. arXiv preprint arXiv:2408.02622, 2024.
meta. llama3.1, 2024. URL https://llama.meta.com/.
Mihaiii. https://huggingface.co/datasets/mihaiii/qa-assistant-2, 2024a.
Mihaiii. https://huggingface.co/datasets/mihaiii/triviamultichoice, 2024b.
Mihaiii. https://huggingface.co/datasets/mihaiii/triviasinglechoice, 2024c.
Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh
Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question an-
swering and speech continuation using spectrogram-powered llm. arXiv preprint arXiv:2305.15255,
2023.
Openai. https://openai.com/index/hello-gpt-4o/, 2024a.
Openai. https://openai.com/index/gpt-4v-system-card/, 2024b.
OpenAssistan. https://huggingface.co/datasets/openassistant/oasst1, 2024.
OpenOrca. https://huggingface.co/datasets/open-orca/openorca/.
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A
large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International conference on machine learning, pages
8748–8763. PMLR, 2021.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
Robust speech recognition via large-scale weak supervision. In International conference on
machine learning, pages 28492–28518. PMLR, 2023.
Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos,
Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al.
Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925,
2023.
sayan1101. https://huggingface.co/datasets/sayan1101/identity-finetune-data, 2024.
Hubert Siuzdak. https://github.com/hubertsiuzdak/snac/, 2024.
David Snyder, Guoguo Chen, and Daniel Povey. Musan: A music, speech, and noise corpus. arXiv
Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Xiangyang Liu, Hang Yan,
Yunfan Shao, Qiong Tang, Shiduo Zhang, et al. Moss: An open conversational large language
model. Machine Intelligence Research, pages 1–18, 2024.
H Touvron, T Lavril, G Izacard, X Martinet, MA Lachaux, T Lacroix, B Rozière, N Goyal, E Hambro,
F Azhar, et al. Open and efficient foundation language models. Preprint at arXiv. https://doi.
org/10.48550/arXiv, 2302, 2023.
vicgalle. https://huggingface.co/datasets/vicgalle/alpaca-gpt4, 2024.
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing
Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech
synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
12
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu,
Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the
world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang,
Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv
preprint arXiv:2311.03079, 2023b.
Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in
streaming. arXiv preprint arXiv:2408.16725, 2024.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint
arXiv:2407.10671, 2024.
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu.
Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882,
2019.
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan,
Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling.
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu.
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.
arXiv preprint arXiv:2305.11000, 2023a.
Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech
tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023b.
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li,
Vera Axelrod, Gary Wang, et al. Google usm: Scaling automatic speech recognition beyond 100
languages. arXiv preprint arXiv:2303.01037, 2023c.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En-
hancing vision-language understanding with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023.
13

Mini-Omni2: Towards Open-Source GPT-4o With Vision, Speech and Duplex Capabilities

Uploaded by

Copyright:

Available Formats

Mini-Omni2: Towards Open-Source GPT-4o With Vision, Speech and Duplex Capabilities

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mini-Omni2: Towards Open-Source GPT-4o With Vision, Speech and Duplex Capabilities

Uploaded by

Copyright:

Available Formats

Mini-Omni2: Towards Open-source GPT-4o with

Vision, Speech and Duplex Capabilities

Zhifei Xie♠♣ Changqiao Wu♠

“ Are you at seaside ? ”

Status tokens “Stop Omni ”

Mini-Omni Language Modeling

Audio Adapter Embedding Vision Adapter Embedding

Audio Encoder❄ Vision Encoder ❄ Omni

wav text image listen

Figure 1: The Mini-Omni2 model architecture.

(d Audio input text response

Figure 4: The special multi-layer vocabulary construction of Mini-Omni2.

3.2 Multimodal Languague Modeling

wav image wav image text wav

Mini-Omni Mini-Omni Mini-Omni

Vision Adapter� Embedding Audio Adapter� Audio Adapter � Embedding Embedding

Vision Encoder ❄ Audio Encoder ❄ Audio Encoder❄

image text wav wav text wav

Figure 5: Mini-Omni2’s three-stage training phases

3.3 Training Strategies

4 Data and Evaluation

4.3 Data Construction

4.4 Experimental Results

Method test-clean test-other dev-clean dev-other

4.5 Case Study

Q: what can you see? (Audio)

Figure 6: Use cases of Mini-Omni2

You might also like