Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Review of Common Online Speaker Diarization Methods

1st Roman Aperdannier Faculty of Business
University of Applied Science
Ansbach, Germany
aperdannier19472@hs-ansbach.de
   2nd Sigurd Schacht Faculty of Business
University of Applied Science
Ansbach, Germany
sigurd.schacht@hs-ansbach.de
   3rd Alexander Piazza Faculty of Business
University of Applied Science
Ansbach, Germany
alexander.piazza@hs-ansbach.de
Abstract

Speaker diarization provides the answer to the question ”who spoke when?” for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.

Index Terms:
online speaker diarization, GMM, i-vector, uis-rnn, self-attention

I Introduction

Speaker diarization is a machine learning task in which the model has the task of assigning audio sequences to the corresponding speakers. Speaker Diarization thus answers the question ”who spoke when”. In the process of Speaker diarization an audio file is divided into individual audio sequences that are separated by a speaker change or the transition from non-speech to speech. This is important information that is necessary for a fully-fledged transcription of audio files. Speaker diarization in combination with automatic speech recognition (ASR) is therefore used in many transcription scenarios. These scenarios include online meetings, conversations at conferences, earnings reports of public corporations, court proceedings, interviews, social media audios/videos, etc. [1]. In some of these scenarios, it is important that the speaker diarization results are available with a low latency. On the one hand, the transcriber can then make direct adjustments to the transcription. On the other hand, the transcription results can be used directly for further analyses, which can afterwards influence certain actions. For example, whether a company stock should be sold or not based on the statements made in the earnings call [2]. This type of speaker diarization is known as online speaker diarization.

I-A Motivation and Background

There are already two papers [1], [3] that provide a review of speaker diarization systems. These papers discuss the historical development, evaluation metrics, different diarization methods, common datasets and current use cases of speaker diarization. However, these papers do not explicitly address online speaker diarization methods and systems. This paper aims to fill this gap by conducting a review of common online speaker diarization methods.
To give a good overview of online diarization methods, the rest of this paper is structured as follows. The next section will give a general introduction to online diarization. For this purpose, the historical development is presented, the taxonomy for the online diarization systems of this paper is introduced and common evaluation metrics are presented. Next various online diarization systems are briefly summarized. Finally, challenges of online speaker diarization will be presented and an outlook will be given.

II Online Speaker Diarization in General

Online speaker diarization systems generally work in the same way as offline speaker diarization systems and can be divided into the following sub-tasks:

  • Speech Activity Detection (SAD): This task enables the system to recognize whether an audio segment contains speech or not.

  • Segmentation: This task attempts to cut audio segments so that they only contain one speaker.

  • Clustering: In this step, the audio segments are assigned to the corresponding speakers.

This pipeline can also be seen in figure 1.

Refer to caption
Figure 1: Diarization Pipeline

However, online diarization systems assume that the input arrives as an audio stream. This means there is not the entire audio file available for speaker diarization. Only the audio segments that have already been annotated can be included in the diarization process of the current audio segment.

II-A Historical Development

The first preliminary work on online speaker diarization was published in 1999. In their work, Daben Liu et al. [4] present an algorithm for recognizing speaker changes in real time. Their system is based on Hidden Markov Models (HMM) in combination with Gaussian Mixture Models (GMM) to define audio classes. A maximum likelihood distance is calculated to recognize a speaker change.

A few years later in 2003, Daben Liu et al. [5] published a paper on online speaker clustering algorithms that have comparable performance to offline speaker clustering algorithms. Two of the newly developed algorithms even outperform the offline hierarchical clustering chosen as a baseline.

In the following years, the first online speaker diarization systems were introduced. These systems are generally structured as follows, with minor deviations. For SAD, these systems use energy-based SAD systems or consider non-speech as a separate class. For segmentation, the audio file is divided into Mel Frequency Cepstral Coefficients (MFCC) of constant length. GMMs are used as audio and speaker representations. Depending on the system, the GMMs are combined with a Universal Background Model (UBM), to also represent the speaker-independent part of the acoustic features. Agglomerative Hierarchical Clustering (AHC) is the most commonly used clustering algorithm [6][7][8][9][10]. Some other systems also include the speaker’s physical location in the diarization process to further improve the results. [11][12].

With the introduction of i-vectors [13] and d-vectors [14], audio and speaker representations in the form of GMMs were replaced in online speaker diarization [15]. The use of neural network-based d-vectors in particular led to a leap in performance in speaker diarization [16] [17].

End-to-end systems are the latest development in the context of speaker diarization. Here, a single neural network takes over the individual sub tasks of speaker diarization. These end-to-end systems are also used for online speaker diarization. Especially in combination with self-attention, these systems lead to an improvement in performance [18][19][20][21] [22].

II-B Taxonomy of this Work

In this paper, online diarization systems are divided into two categories. All systems with a modular structure are assigned to the first category. This includes all systems that process at least one sub-task separately. All pure end-to-end systems are assigned to the second category. These include systems that process all sub-tasks with a single machine learning model. Table I provides an overview of the systems. Further systems are not considered in more detail in this paper, but are included in the assigned category.

TABLE I: Overview of online diarization systems
Modular Systems End-to-End Systems
GMM based [7] I-vector [15] UIS-RNN [16] [17] Turn to diarize [23] Further modular systems [6] [8] [11] [12] [9] [10] [24] Frame-wise streaming [20] Minivox [25] Further EEND systems [19][21] [22]

II-C Metrics

The metrics described here are not purely online speaker diarization metrics. They are also used in offline speaker diarization.

II-C1 DER

The Diarization Error Rate (DER) [26] is the most common metric to evaluate online diarization systems. The DER is made up of three different errors. These include:

  • False alarm (FA): When speech is recognized even though there is no speech in the segment

  • Missed Speech (MS): If speech is not recognized although there is speech in the segment

  • Speaker Confusion (SC): If the wrong speaker is assigned to a segment.

The DER is then calculated from the sum of the errors divided by the duration of the whole audio file as can be seen in Equation (1).

DER=FA+MS+SCTotalDurationofTime𝐷𝐸𝑅𝐹𝐴𝑀𝑆𝑆𝐶𝑇𝑜𝑡𝑎𝑙𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑜𝑓𝑇𝑖𝑚𝑒DER=\frac{FA+MS+SC}{TotalDurationofTime}\ italic_D italic_E italic_R = divide start_ARG italic_F italic_A + italic_M italic_S + italic_S italic_C end_ARG start_ARG italic_T italic_o italic_t italic_a italic_l italic_D italic_u italic_r italic_a italic_t italic_i italic_o italic_n italic_o italic_f italic_T italic_i italic_m italic_e end_ARG (1)

Normally only the DER is specified in the evaluation of diarization systems. But in some cases, the individual error components are also reported.

II-C2 JER

Another metric used in some studies is the Jaccard Error Rate (JER) [27]. The aim of this metric is to rate each speaker with the same weight. Regardless of the speaker’s speaking time. To do this, the error per speaker is first calculated and then divided by speakers speaking time. The error per speaker is the sum of FA and MS as can be seen in Equation (2).

JER=1NiNrefFAi+MSiTOTALi𝐽𝐸𝑅1𝑁superscriptsubscript𝑖subscript𝑁𝑟𝑒𝑓𝐹subscript𝐴𝑖𝑀subscript𝑆𝑖𝑇𝑂𝑇𝐴subscript𝐿𝑖JER=\frac{1}{N}\sum_{i}^{N_{ref}}\frac{FA_{i}+MS_{i}}{TOTAL_{i}}italic_J italic_E italic_R = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_F italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_M italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_O italic_T italic_A italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (2)

II-D Datasets

The datasets for online speaker diarization are the same as for offline diarization systems. However, the datasets are handed to the online speaker diarization systems as a stream. The following datasets are used most frequently:

  • CALLHOME: This dataset contains 500 telephone multilingual language sessions with 2 to 7 speakers [28].

  • 2003 NIST Rich Transcription: This dataset is a collection of different datasets. It contains a total of 13 hours of annotated multilingual speech with several speakers [29].

  • DIHARD (I/II/III): These datasets build on each other. The DIHARD datasets are multilingual and contain 1-8 speakers [30][27][31].

  • VoxConverse: This dataset contains over 50 hours of multilingual speech extracted from YouTube videos [32].

III Modular Online Speaker Diarization Systems

The following sections take a closer look at some of the modular online speaker diarization systems. The systems are arranged in the historical order in which they were published.

III-A Online Speaker Diarization with GMMs

As already described in section II-A, the first online diarization systems were primarily built with GMMs. GMMs are essentially an iterative soft clustering. First a Gaussian distribution is randomly initialized for each cluster. Next the probabilities of belonging to the Gaussian distribution are then calculated for each data point. Data points that belong to one of the Gaussian distributions with a very high probability (above a certain threshold) are assigned to this Gaussian distribution or cluster. All other data points are assigned to several clusters. After that new Gaussian distributions are calculated from the assigned data points, this process is repeated iteratively until the Gaussian distributions no longer show any major changes. In speaker diarization, MFCCs are taken as data points and a speaker is represented by the mean vectors and the covariance matrices of a GMM. Due to their iterative nature, GMMs can also be used well in online speaker diarization systems [33] [34].

As an example of an online diarization system based on GMMs, the work of Markov et. al. [7] will now be described in more detail. Markov et. al. have developed a system that decides for each new speech segment whether it matches a known speaker GMM. If no matching speaker GMM is found, a new speaker GMM is created. Speaker GMMs are deleted if no new speech segment has been assigned to the GMM for a long time.

The system uses three different GMMs as SAD component. The first represents non-speech. The other GMMs represent voice characteristics of the two biological sexes.

Segmentation is performed using a logic that includes parameters such as minimum segment length (MSL), maximum pause in segment (MPS) and maximum speech in pause (MSP).

In the subsequent clustering component, the system decides whether it is a new speaker or a known speaker. This decision is made using a likelihood ratio. If it is a known speaker, the GMM of this speaker is updated with the new data point. If it is a new speaker, a new speaker GMM is created from the associated gender GMM. This allows the system to integrate new speakers online. By deleting speaker GMMs that are no longer used, the system is able to process audio endlessly. From a latency of 3 seconds, the system delivers solid performance on the evaluation dataset with a DER <12%.

III-B Online Diarization with i-vectors

I-vectors have developed from the problems of GMMs with intersession variability. In this context, intersession variability means that the same speaker sounds different in different recordings. Joint Factor Analysis (JFA) was proposed to counteract this problem [35]. The JFA breaks down the vector of the GMM into individual components. The result is a speaker independent component, speaker dependent component, channel dependent component and a residual component.

In a subsequent study, it was found that the channel dependent component also contains information about the speaker [36]. As a result, the speaker component and the channel component were combined into a common variability matrix. The column weights of this combined matrix are also referred to as i-vectors. In the following years, these i-vectors were used as a representation for audio segments in both online and offline speaker diarization systems.

Dimitriadis et al. [15] have developed an online diarization system that uses i-vectors, among other things. The system consists of a SAD, a segmentation and a clustering component. In addition, an ASR component is integrated to improve the performance of the system. The system architecture of Dimitriadis et. al. is visualized in figure 2. The components of this system will be described in more detail in the following.

Refer to caption
Figure 2: Dimitriadis system architecture

The SAD component is based on a neural network and was inspired by the work of Thomas et al. [37]. The output of the SAD component is passed directly to the ASR module. This validates that no words are contained in non-speech segments. Thereby the ASR module can eliminate many false positives from the SAD component.

The segmentation component receives continuous audio segments in the form of cepstral acoustic features and the ASR transcript as input. The audio segment is then split into two segments at each word boundary. Two Gaussians are fitted with the two sub-segments and compared with the BIC algorithm [38]. If the two sub-segments differ sufficiently, a speaker change is set here. The additional ASR module ensures that speaker changes are only set at word boundaries.

For the clustering, i-vectors are generated from the homogeneous audio segments. These are then clustered using the x-means algorithm. The x-means is a variation of the k-means algorithm [39]. The x-means has a linear complexity proportional to the number of audio segments to be clustered. However, in order to ensure online diarization, long audio files cannot be clustered on the entire history. Therefore, Dimitriadis et al. introduce an active window to limit the history. However, this means that clustering of two different active windows is no longer consistent. To solve this problem, the system provides a fast speaker label based on the clustering of the active window. A more accurate speaker label is then delivered with a slightly higher latency. The accurate label is generated using a reconciliation algorithm [40], which minimizes the hamming distance of the speaker labels between two adjacent active windows. This approach gives the end user the option of receiving speaker labels with a short latency and updating them with the more accurate labels at a later point in time.

III-C Supervised Online Clustering - UIS RNN

As already described in Section II-A Historical development, the current speaker diarization systems are based on neural networks (NN). There is one approach of replacing the entire system with a single neuronal network. These end-to-end systems are discussed in more detail in the next section IV. Another approach is to replace individual sub-tasks with trainable neural networks. In the online diarization system by Zhang et. al. [16], the focus is on the use of a fully supervised clustering component. This system is described in more detail below.

As SAD, two simple Gaussian distributions are used to filter out audio segments that do not contain speech. Subsequently, overlapping audio segments are cut. For the segments, representations are generated in the form of d-vectors [14]. These d-vectors are used as input for supervised clustering. The algorithm unbounded interleaved-state recurrent neural network (UIS RNN) was developed as a clustering component for this system. UIS RNN receives three sets as input in the training scenario:

  • X=(x1,x2,x3,,xT)𝑋subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥𝑇X=(x_{1},x_{2},x_{3},\text{…},x_{T})italic_X = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) where each xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the d-vector of an audio segment

  • Y=(y1,y2,y3,,yT)𝑌subscript𝑦1subscript𝑦2subscript𝑦3subscript𝑦𝑇Y=(y_{1},y_{2},y_{3},\text{…},y_{T})italic_Y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) where each ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the speaker-id of the corresponding audio segment

  • Z=(z2,z3,,zT)𝑍subscript𝑧2subscript𝑧3subscript𝑧𝑇Z=(z_{2},z_{3},\text{…},z_{T})italic_Z = ( italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) where zt=1subscript𝑧𝑡1z_{t}=1italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 if there is a speaker change at time t𝑡titalic_t. In all other cases, zt=0subscript𝑧𝑡0z_{t}=0italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0

The aim of UIS RNN is to represent the combined probability of p(X,Y,Z)𝑝𝑋𝑌𝑍p(X,Y,Z)italic_p ( italic_X , italic_Y , italic_Z ). UIS RNN can therefore be split into three individual components.

III-C1 Speaker Change p(zt|z[t1])𝑝conditionalsubscript𝑧𝑡subscript𝑧delimited-[]𝑡1p(z_{t}|z_{[t-1]})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT [ italic_t - 1 ] end_POSTSUBSCRIPT )

The speaker change component indicates how likely it is that there will be a speaker change at time t𝑡titalic_t. In the implementation of UIS RNN, this component is implemented as a coin flip for simplification.

III-C2 Speaker Assignment p(yt|zt,y[t1])𝑝conditionalsubscript𝑦𝑡subscript𝑧𝑡subscript𝑦delimited-[]𝑡1p(y_{t}|z_{t},y_{[t-1]})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT [ italic_t - 1 ] end_POSTSUBSCRIPT )

This component models the probability of which speaker is assigned after a speaker change. UIS RNN uses the Chinese restaurant process (CRP) [41] for this task. This ensures that a speaker who has already spoken often is more likely to be assigned than a speaker who has spoken less frequently. The probability of a new speaker entering the conversation is represented by a constant probability. The reason for this is that in a single domain, the probability of a new speaker joining the conversation is fairly constant. For example, a new speaker is very unlikely to join a phone call, but very likely to join a movie.

III-C3 Sequence Generation p(xt|x[t1],y[t])𝑝conditionalsubscript𝑥𝑡subscript𝑥delimited-[]𝑡1subscript𝑦delimited-[]𝑡p(x_{t}|x_{[t-1]},y_{[t]})italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT [ italic_t - 1 ] end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT [ italic_t ] end_POSTSUBSCRIPT )

The RNN variant Gated Recurrent Unit (GRU) [42] is used to implement this component. The aim of this component is to model the probability of the embedding xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the embedding x[t1]subscript𝑥delimited-[]𝑡1x_{[t-1]}italic_x start_POSTSUBSCRIPT [ italic_t - 1 ] end_POSTSUBSCRIPT and the speaker label ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To do this, UIS RNN creates separate GRU instances for each speaker. Each instance has a state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t depending on the speaker ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The output of the entire RNN is mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The current sequence xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is inferred from mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The system is trained by maximizing the logarithm of the combined probability P(X,Y,Z)𝑃𝑋𝑌𝑍P(X,Y,Z)italic_P ( italic_X , italic_Y , italic_Z ). The GRU hidden states of the RNN and the number of times a speaker has already spoken in the CRP are updated in the training process by a greedy MAP algorithm after each new audio sequence. As a result, UIS RNN works online and can generate a speaker label with a short latency as soon as a new audio segment arrives. With the described supervised approach of UIS-RNN, the algorithm achieves better results than state-of-the-art spectral offline clustering algorithms in their evaluation. Although UIS-RNN works online.

III-D Supervised Online Clustering - Turn to Diarize

Supervised speaker diarization systems need a lot of training data in order to deliver good results. The annotation of audio data is time consuming and therefore cost intensive. As a rule, you can expect to spend 2 hours on a 10 minute audio file [23]. In their work, Xia et. al. [23] present a system that can be trained on the basis of speaker turn labels. For this purpose, <st>expectation𝑠𝑡<st>< italic_s italic_t > tokens are inserted into the ASR transcript of the audio file at each speaker turn. This means that an exact timestamp no longer needs to be set during annotation. This speeds up labeling many times over. In addition, the semantic information in the audio data can be processed better.

The system from Xia et al. is structured as follows. A transformer transducer model takes over the ASR and speaker turn detection. The detected segments are then fed into a Long short-term memory model (LSTM) in order to calculate the corresponding d-vectors. An online variant of spectral clustering is used to cluster the d-vectors, which also includes the speaker turns for decision making. The individual system components are described below.

III-D1 Speaker Turn Detection

A recurrent neural network transducer (RNN-T) is used for this purpose. This is a supervised ASR model consisting of an audio encoder, a label encoder and an neural network for generating the final output sequences. To increase the speed of the model, the Transformer Transducer (T-T) variant of the RNN-T and a bigram label encoder are used in their work. During training, the model receives transcripts with speaker turn tokens <st>expectation𝑠𝑡<st>< italic_s italic_t > as target output and log-mel audio features as input. At inference time, the entire output of the model is ignored, except for the <st>expectation𝑠𝑡<st>< italic_s italic_t > tokens with their associated timestamp. In addition, a confidence score is calculated for the <st>expectation𝑠𝑡<st>< italic_s italic_t > token, which is important for clustering.

III-D2 Speaker Encoder

An LSTM is used as the speaker encoder, which provides a d-vector as output. The model functions independently of the ASR script. As an input the model receives audio segments, which are separated by speaker turns. The model only uses 75% of the audio segments and ignores the information at the segment boundaries. This reduces the risk of errors being carried over from the speaker turn detection component.

III-D3 Spectral Clustering

The spectral clustering algorithm with an additional speaker turn constraint is used as the clustering component. A constraint matrix is calculated for this QϵRNxN𝑄italic-ϵsuperscript𝑅𝑁𝑥𝑁Q\epsilon R^{NxN}italic_Q italic_ϵ italic_R start_POSTSUPERSCRIPT italic_N italic_x italic_N end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of audio segments. The matrix is filled as follows:

  • If there is a speaker turn between the adjacent segments i𝑖iitalic_i and i+1𝑖1i+1italic_i + 1 with a confidence score greater than a threshold, these segments are labeled as cannot-link (-1).

  • If there is no speaker turn between adjacent segments, these are labeled as must-link (+1) in the constraint matrix.

  • Non-adjacent segments are assigned Qij=0subscript𝑄𝑖𝑗0Q_{ij}=0italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0.

For values in the constraint matrix Q > 0, the similarity between the segments is increased in the clustering process. For values < 0, the similarity is reduced.

RNN-T and LSTM are streaming models. The bottleneck of the system is therefore the clustering component. However, the number of segments can be significantly reduced by using turn wise segments. This leads to a leap in clustering performance. As a result, spectral clustering can be performed after each arrival of a new segment and the entire system can be used in online speaker diarization scenarios.

IV End-to-End Online Speaker Diarization

As mentioned in the previous section, supervised online diarization systems also include so called end-to-end systems. These train a single neural network to solve all sub tasks of speaker diarization. Two of these end-to-end systems are described in more detail below.

IV-A Frame-wise Streaming End-to-End Speaker Diarization

In their work, Liang et al. [20] present an online speaker diarization system that processes the audio stream frame by frame and delivers the corresponding diarization results in real time. The system consists of an audio encoder and an attractor decoder. In this case, an attractor is the representation of a speaker. Finally, diarization results are generated by a similarity comparison of the attractors with the audio embeddings. The system architecture can be seen in figure 3.

Refer to caption
Figure 3: System architecture of FS-EEND

The decoder is implemented with non-autoregressive self-attention. On the one hand, self-attention ensures that the speaker labels remain consistent over time. On the other hand, self-attention makes it possible to better distinguish the speakers. In order to achieve better performance, a look-ahead mechanism is added to the system. This mechanism receives some future frames as input and thus makes it possible to recognize new unknown speakers in real time. The architecture of the whole system is analyzed in more detail below.

IV-A1 Embedding Encoder

The embedding encoder consists of a linear layer and a masked transformer encoder. The frame-wise input sequence X=(x1,,x1,,xT)𝑋subscript𝑥1subscript𝑥1subscript𝑥𝑇X=(x_{1},...,x_{1},...,x_{T})italic_X = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is binary masked so that the self-attention module in the transformer cannot include any future information.

IV-A2 Look-ahead Mechanism

In most online speaker diarization scenarios, a low latency is acceptable. In addition, the quality of the embeddings and attractors can be significantly increased with a few future frames. For these reasons, it was decided to implement a look-ahead mechanism in this system. This is realized in the form of a one dimensional convolution along the time axis [20]. The latency (number of future frames) is controlled by the kernel size of the convolution.

IV-A3 Attractor Decoder

The decoder is based on self-attention. Self-attention based systems have a longer memory than, for example, an LSTM-based system. This means that known speakers can be correctly assigned over a longer period of time. The difference to tasks such as sequence generation is that the results in the form of attractors do not have to be output in a specific order. For this reason, a non-autoregressive attractor decoder is used in this work, which can output several attractors in parallel.

In general, EEND systems do not cope well if the audio files for inference contain more speakers than in training data [43]. For this reason, the maximum number of speakers is limited to 4 in the work of Liang et al. [20].

The decoder receives the embedding of the current frame as input and decides which attractors are updated with the embedding. The input must always be different for each attractor, therefore a positional encoding is added to the embedding. To obtain an attractor for the current frame as,tsubscript𝑎𝑠𝑡a_{s,t}italic_a start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT, the decoder needs two sources of information. Firstly, the previous attractors of the speaker as,tsubscript𝑎𝑠superscript𝑡a_{s,t^{\prime}}italic_a start_POSTSUBSCRIPT italic_s , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. This is achieved by the masked frame self-attention (MFSA) module. Secondly, the attractors of the other speakers to the current frame as,tsubscript𝑎superscript𝑠𝑡a_{s^{\prime},t}italic_a start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t end_POSTSUBSCRIPT in order to sharpen the distance to the other attractors. This task is performed by the decoder’s cross-attractor self-attention (CASA) module. The two self-attention modules are combined in a feed forward network, which provides the attractors as output. Finally, the speaker labels y𝑦yitalic_y can be calculated using the inner product of the attractors At=[a1,t,,as,t,,aS,t]subscript𝐴𝑡subscript𝑎1𝑡subscript𝑎𝑠𝑡subscript𝑎𝑆𝑡A_{t}=[a_{1,t},...,a_{s,t},...,a_{S,t}]italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_a start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_S , italic_t end_POSTSUBSCRIPT ] and the embedding etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For training, the sum of two loss functions is minimized. The first loss function is the binary cross entropy (BCE) of predicted speaker labels Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and the ground truth Y𝑌Yitalic_Y.

Lossd=1Tt=1TBCE(yt^,yt)𝐿𝑜𝑠subscript𝑠𝑑1𝑇superscriptsubscript𝑡1𝑇𝐵𝐶𝐸^subscript𝑦𝑡subscript𝑦𝑡Loss_{d}=\frac{1}{T}\sum_{t=1}^{T}BCE(\hat{y_{t}},y_{t})italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_C italic_E ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (3)

The second loss function is the embedding similarity loss, which is calculated by the mean squared error of the cosine similarity between two embeddings and the cosine similarity of the corresponding speaker labels.

Losse=1TxTj=1Tk=1TMSE(ej,ek,yj,yk)𝐿𝑜𝑠subscript𝑠𝑒1𝑇𝑥𝑇superscriptsubscript𝑗1𝑇superscriptsubscript𝑘1𝑇𝑀𝑆𝐸subscript𝑒𝑗subscript𝑒𝑘subscript𝑦𝑗subscript𝑦𝑘Loss_{e}=\frac{1}{TxT}\sum_{j=1}^{T}\sum_{k=1}^{T}MSE(\left\langle e_{j},e_{k}% \right\rangle,\left\langle y_{j},y_{k}\right\rangle)italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T italic_x italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_M italic_S italic_E ( ⟨ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ , ⟨ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ) (4)

The latency of the system can be controlled by the kernel size of the look-ahead component. FS-EEND achieves better results than the selected comparison systems from a latency time of 1s upwards.

IV-B Online Learning Minivox

In their work, Lin et al. [25] present a benchmark for speaker diarization systems based on online learning. In addition they develop an online learning speaker diarization system. In the following, both the benchmark and the online diarization system will be examined in more detail.

The Minivox Benchmark converts classic speaker diarization datasets into audio streams. To do this, random speaker n and associated audio sequences m are attached to each other. The parameter p can be used to control how many of the speaker labels the system receives as feedback. The parameter p thus simulates a real user who does not always provide feedback. A result can look like this, for example:

  • X=[1,1,1,2,2,2,1,1,3,3,3,3,]𝑋111222113333X=[1,1,1,2,2,2,1,1,3,3,3,3,…]italic_X = [ 1 , 1 , 1 , 2 , 2 , 2 , 1 , 1 , 3 , 3 , 3 , 3 , … ]

  • Y=[1,_,_,2,_,_,_,1,_,_3,_,]forp= 0.3𝑌1__2___1__3_𝑓𝑜𝑟𝑝0.3Y=[1,\_,\_,2,\_,\_,\_,1,\_,\_3,\_,…]\ for\ p\ =\ 0.3italic_Y = [ 1 , _ , _ , 2 , _ , _ , _ , 1 , _ , _ 3 , _ , … ] italic_f italic_o italic_r italic_p = 0.3

Training can be carried out either with or without an oracle. The oracle specifies the maximum number of speakers as initial input. The performance of the system is measured with the DER.

The presented Online Speaker Diarization System is based on the concept of the Contextual Bandit [44]. For each decision option, the bandit has an arm that stands for an unknown reward. The bandit tries to find a good trade-off between exploiting known options and discovering new, possibly better options. In case of the contextual bandit, the bandit is given a context (e.g. audio embeddings) that he can incorporate into his decision. The context in this work is the maximum number of speakers.

If the online diarization system starts with the maximum number of speakers N, the bandit initially receives N arms, as well as an additional arm for no speech. Without this information, the bandit starts with one arm for new speaker and one arm for no speech. The system then begins the learning and inference process. The following cases must be distinguished:

  • The system selects new speaker and the user confirms the selection: The system initializes a new speaker.

  • The system selects no speech and the user objects to the selection: The system initializes a new speaker.

  • The system selects speaker nxsubscript𝑛𝑥n_{x}italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and the user objects with new speaker: The system initializes a new speaker by copying speaker nxsubscript𝑛𝑥n_{x}italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

In addition, it may be the case that the user does not provide any feedback. A self-supervision module has been added to the system for this purpose. This clusters the audio segments that have already been labeled and the current audio segment. Then the cluster label is used as a substitute for the user feedback. Online variants of GMM, k-means and KNN are used as cluster algorithms. The authors are aware that these perform poorly compared to there offline variants, but want to ensure the online suitability of the system.

The presented online diarization system has a comparatively poor performance. However, the online learning scenario poses a particular challenge, as the system never gets to see the real labels of the audio segments unless the user confirms an assigned speaker label.

V Challenges of Online Speaker Diarization

Online speaker diarization has to face a number of challenges. These include the trade-off between accuracy and latency, missing training data and problems with multispeaker in case of the EEND systems. The various challenges are examined in more detail below.

V-A Tradeoff between Accuracy and Latency

In many current online diarization systems, it can be observed that the error is reduced if a higher latency is accepted. The reason for this is that more audio information can be included in the decision if a larger input sequence is available. Depending on the system, this trade-off is handled differently. Coria et al. [18] use an active window that is controlled by the parameter λ𝜆\lambdaitalic_λ. The larger λ𝜆\lambdaitalic_λ is selected, the higher the latency and the lower the DER. Morrone et al. [45] use a similar approach in their work with a CSS window based on the sliding window of Chen et al. [46]. Here it can be also observed that the DER decreases when the window is enlarged. Liang et al. [20] show the system some future frames via the look-ahead mechanism. As described in section IV-A, the latency is controlled via the kernel size of the convolution that makes the future frames available. Dimitriadis et al. [15] solve the problem by returning a fast label and later delivering a more accurate label. Further examples of the accuracy latency tradeoff can be found in [19] [47] [22].

However, it can be observed that the DER no longer drops significantly above a certain latency. This sweet spot latency differs from system to system. For Coria et al. [18] it is 3s, for Morrone et al. [45] it is a window size of 15s. Nevertheless, there is a need for further research to decouple latency from accuracy.

V-B Training Data

Current supervised online and offline diarization systems need a lot of high quality training data to deliver good results. Such training data is only available to a limited extent in English and is usually associated with high costs [28] [29] [48]. For some languages, there is still no corresponding training data for speaker diarization. An approach such as Xia et al. Turn to Diarize [23] attempts to simplify the annotation of the training data. Lin et al. [25] try to perform the training process through online learning parallel to inference time. These are good approaches, but further research and work is needed to reduce this problem.

V-C Multispeaker in EEND Systems

Many end-to-end online diarization systems can only process a limited number of speakers. Horiguchi et al. [43] state in their work that end-to-end online diarization systems always have problems when the number of speakers for inference is higher than the number of speakers in the training process. In their work, they also propose a solution by adding blockwise clustering to the end-to-end system. However, they move away from the actual core idea of end-to-end systems, which is to solve the entire speaker diarization with a single neural network. Also, the maximum number of speakers in online speaker diarization cannot be determined in advance, as the audio file arrives as a stream and additional speakers may be added at a later point in time. This is a research gap that needs to be closed in the future in order to be able to handle a flexible number of speakers for inference.

VI Conclusion

Online speaker diarization is a research topic that has been dealt with for a long time. The first systems were created with GMMs. With the introduction of i-vectors, GMM-based systems were largely replaced. A short time later, embeddings of audio segments were developed in the form of d-vectors, which are generated by neural networks. However, not only audio representations were replaced by trainable modules. Other components of the speaker diarization pipeline, such as the clustering, have also been replaced by supervised approaches. The latest innovations are end-to-end online diarization systems. These take the approach of replacing the entire diarization pipeline with a single trainable model.

All these developments have continuously improved online speaker diarization. However, online speaker diarization is a challenging topic. The limited amount of input data makes it difficult to undercut the error rate of comparable offline diarization systems. Thus, a tradeoff between accuracy and latency must always be made. Rare training data is a problem for both offline and online speaker diarization. In end-to-end systems, the maximum number of speakers is implicitly limited by the training data used. Solutions still need to be developed for this too.

In summary, this paper provides a good overview of the topic of online speaker diarization. Also this paper shows that online speaker diarization is a current and flourishing topic that still offers a lot of potential for further research.

References

  • [1] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A Review of Speaker Diarization: Recent Advances with Deep Learning,” Nov. 2021, arXiv:2101.09624 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2101.09624
  • [2] P. A. L. de Castro, “mt5b3: A Framework for Building Autonomous Traders,” CoRR, 2021.
  • [3] X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker Diarization: A Review of Recent Research,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356–370, Feb. 2012. [Online]. Available: http://ieeexplore.ieee.org/document/6135543/
  • [4] D. Liu and F. Kubala, “Fast speaker change detection for broadcast news transcription and indexing,” in Sixth European conference on speech communication and technology.   Citeseer, 1999.
  • [5] D. Lilt and F. Kubala, “Online speaker clustering,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, May 2004, pp. I–333, iSSN: 1520-6149.
  • [6] H. Aronowitz, Y. A. Solewicz, and O. Toledo-Ronen, “Online two speaker diarization,” in Odyssey 2012-The Speaker and Language Recognition Workshop, 2012.
  • [7] K. Markov and Satoshi Nakamura, “Never-ending learning system for on-line speaker diarization,” in 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).   The Westin Miyako Kyoto: IEEE, 2007, pp. 699–704. [Online]. Available: http://ieeexplore.ieee.org/document/4430197/
  • [8] T. Oku, S. Sato, A. Kobayashi, S. Homma, and T. Imai, “Low-latency speaker diarization based on Bayesian information criterion with multiple phoneme classes,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2012, pp. 4189–4192.
  • [9] G. Soldi, C. Beaugeant, and N. Evans, “Adaptive and online speaker diarization for meeting data,” in 2015 23rd European Signal Processing Conference (EUSIPCO).   Nice: IEEE, Aug. 2015, pp. 2112–2116. [Online]. Available: http://ieeexplore.ieee.org/document/7362757/
  • [10] C. Vaquero, O. Vinyals, and G. Friedland, “A hybrid approach to online speaker diarization.” in InterSpeech, 2010, pp. 2638–2641.
  • [11] J. Schmalenstroeer and R. Haeb-Umbach, “Online speaker change detection by combining BIC with microphone array beamforming,” in Interspeech 2006.   ISCA, Sep. 2006, pp. paper 1078–Wed1FoP.2–0. [Online]. Available: https://www.isca-speech.org/archive/interspeech_2006/schmalenstroeer06_interspeech.html
  • [12] ——, “Joint speaker segmentation, localization and identification for streaming audio,” in Eighth Annual Conference of the International Speech Communication Association.   Citeseer, 2007.
  • [13] G. Sell and D. Garcia-Romero, “Speaker diarization with plda i-vector scoring and unsupervised calibration,” in 2014 IEEE Spoken Language Technology Workshop (SLT).   South Lake Tahoe, NV, USA: IEEE, Dec. 2014, pp. 413–417. [Online]. Available: http://ieeexplore.ieee.org/document/7078610/
  • [14] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   Florence, Italy: IEEE, May 2014, pp. 4052–4056. [Online]. Available: http://ieeexplore.ieee.org/document/6854363/
  • [15] D. Dimitriadis and P. Fousek, “Developing On-Line Speaker Diarization System.” in Interspeech, 2017, pp. 2739–2743.
  • [16] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully Supervised Speaker Diarization,” Feb. 2019, arXiv:1810.04719 [cs, eess, stat]. [Online]. Available: http://arxiv.org/abs/1810.04719
  • [17] E. Fini and A. Brutti, “Supervised online diarization with sample mean loss for multi-domain data,” Nov. 2019, arXiv:1911.01266 [cs, eess] version: 3. [Online]. Available: http://arxiv.org/abs/1911.01266
  • [18] J. M. Coria, H. Bredin, S. Ghannay, and S. Rosset, “Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation,” Sep. 2021, arXiv:2109.06483 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2109.06483
  • [19] E. Han, C. Lee, and A. Stolcke, “BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 7193–7197, arXiv:2011.02678 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2011.02678
  • [20] D. Liang, N. Shao, and X. Li, “Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors,” Sep. 2023, arXiv:2309.13916 [cs, eess] version: 1. [Online]. Available: http://arxiv.org/abs/2309.13916
  • [21] W. Wang and M. Li, “End-to-end Online Speaker Diarization with Target Speaker Tracking,” Oct. 2023, arXiv:2310.08696 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2310.08696
  • [22] Y. Xue, S. Horiguchi, Y. Fujita, S. Watanabe, and K. Nagamatsu, “Online End-to-End Neural Diarization with Speaker-Tracing Buffer,” Mar. 2021, arXiv:2006.02616 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2006.02616
  • [23] W. Xia, H. Lu, Q. Wang, A. Tripathi, Y. Huang, I. L. Moreno, and H. Sak, “Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection,” Jan. 2022, arXiv:2109.11641 [cs, eess] version: 3. [Online]. Available: http://arxiv.org/abs/2109.11641
  • [24] Y. Zhang, Q. Lin, W. Wang, L. Yang, X. Wang, J. Wang, and M. Li, “Low-Latency Online Speaker Diarization with Graph-Based Label Generation,” Jun. 2022, arXiv:2111.13803 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2111.13803
  • [25] B. Lin and X. Zhang, “Speaker Diarization as a Fully Online Learning Problem in MiniVox,” Oct. 2020, arXiv:2006.04376 [cs, stat] version: 3. [Online]. Available: http://arxiv.org/abs/2006.04376
  • [26] J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” in Machine Learning for Multimodal Interaction: Third International Workshop, MLMI 2006, Bethesda, MD, USA, May 1-4, 2006, Revised Selected Papers 3.   Springer, 2006, pp. 309–322.
  • [27] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The Second DIHARD Diarization Challenge: Dataset, task, and baselines,” Jun. 2019, arXiv:1906.07839 [cs, eess]. [Online]. Available: http://arxiv.org/abs/1906.07839
  • [28] Canavan, Alexandra, Graff, David, and Zipperlen, George, “CALLHOME American English Speech,” 1997, artwork Size: 1830160 KB Pages: 1830160 KB. [Online]. Available: https://catalog.ldc.upenn.edu/LDC97S42
  • [29] Fiscus, Jonathan G., Doddington, George R., Le, Audrey, Sanders, Greg, Przybocki, Mark, and Pallett, David, “2003 NIST Rich Transcription Evaluation Data,” Aug. 2007, artwork Size: 2097152 KB Pages: 2097152 KB. [Online]. Available: https://catalog.ldc.upenn.edu/LDC2007S10
  • [30] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “First DIHARD challenge evaluation plan,” tech. Rep., 2018, publisher: Linguistic Data Consortium.
  • [31] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The Third DIHARD Diarization Challenge,” Apr. 2021, arXiv:2012.01477 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2012.01477
  • [32] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarisation in the wild,” in Interspeech 2020, Oct. 2020, pp. 299–303, arXiv:2007.01216 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2007.01216
  • [33] J. Liu, D. Cai, and X. He, “Gaussian Mixture Model with Local Consistency,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1, pp. 512–517, Jul. 2010, number: 1. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/7659
  • [34] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, Jan. 1995, conference Name: IEEE Transactions on Speech and Audio Processing. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/365379
  • [35] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and Session Variability in GMM-Based Speaker Verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 4, pp. 1448–1460, May 2007. [Online]. Available: http://ieeexplore.ieee.org/document/4156203/
  • [36] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, pp. 788–798, Jun. 2011.
  • [37] S. Thomas, G. Saon, M. V. Segbroeck, and S. S. Narayanan, “Improvements to the IBM speech activity detection system for the DARPA RATS program,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   South Brisbane, Queensland, Australia: IEEE, Apr. 2015, pp. 4500–4504. [Online]. Available: http://ieeexplore.ieee.org/document/7178822/
  • [38] S. S. Chen and P. S. Gopalakrishnan, “Clustering via the Bayesian information criterion with applications in speech recognition,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 2.   IEEE, 1998, pp. 645–648.
  • [39] D. Pelleg, A. W. Moore, and others, “X-means: Extending k-means with efficient estimation of the number of clusters.” in Icml, vol. 1, 2000, pp. 727–734.
  • [40] K. Church, W. Zhu, J. Vopicka, J. Pelecanos, D. Dimitriadis, and P. Fousek, “Speaker diarization: A perspective on challenges and opportunities from theory to practice,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 4950–4954, iSSN: 2379-190X. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7953098
  • [41] D. M. Blei and P. I. Frazier, “Distance Dependent Chinese Restaurant Processes,” Aug. 2011, arXiv:0910.1022 [stat]. [Online]. Available: http://arxiv.org/abs/0910.1022
  • [42] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” Sep. 2014, arXiv:1406.1078 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1406.1078
  • [43] S. Horiguchi, S. Watanabe, P. Garcia, Y. Takashima, and Y. Kawaguchi, “Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 706–720, 2023, arXiv:2206.02432 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2206.02432
  • [44] S. Agrawal and N. Goyal, “Thompson sampling for contextual bandits with linear payoffs,” in International conference on machine learning.   PMLR, 2013, pp. 127–135.
  • [45] G. Morrone, S. Cornell, D. Raj, L. Serafini, E. Zovato, A. Brutti, and S. Squartini, “Low-Latency Speech Separation Guided Diarization for Telephone Conversations,” Oct. 2022, arXiv:2204.02306 [eess] version: 2. [Online]. Available: http://arxiv.org/abs/2204.02306
  • [46] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: dataset and analysis,” May 2020, arXiv:2001.11482 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2001.11482
  • [47] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker Diarization with LSTM,” Jan. 2022, arXiv:1710.10468 [cs, eess, stat]. [Online]. Available: http://arxiv.org/abs/1710.10468
  • [48] Cieri, Christopher, Graff, David, Kimball, Owen, Miller, Dave, and Walker, Kevin, “Fisher English Training Part 2, Speech,” Apr. 2005, artwork Size: 29643984 KB Pages: 29643984 KB. [Online]. Available: https://catalog.ldc.upenn.edu/LDC2005S13