Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts

Lu, Ching-Ta; Wang, Liang-Yu

doi:10.3390/app14135718

Open AccessArticle

Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts

by

Ching-Ta Lu

^1,*

and

Liang-Yu Wang

²

¹

Department of Communications Engineering, Feng Chia University, Taichung City 407, Taiwan

²

Department of Information Communication, Asia University, Taichung City 413, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5718; https://doi.org/10.3390/app14135718

Submission received: 26 May 2024 / Revised: 21 June 2024 / Accepted: 27 June 2024 / Published: 29 June 2024

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed system can automatically generate conference/meeting minutes with labeled speakers and produce keyword spotting. So, the proposed system reduces the heavy task of recording the meeting minutes and improves concentration during meetings.

Abstract

Producing conference/meeting minutes requires a person to simultaneously identify a speaker and the speaking content during the course of the meeting. This recording process is a heavy task. Reducing the workload for meeting minutes is an essential task for most people. In addition, providing conference/meeting highlights in real time is helpful to the meeting process. In this study, we aim to implement an automatic meeting minutes generation system (AMMGS) for recording conference/meeting minutes. A speech recognizer transforms speech signals to obtain the conference/meeting text. Accordingly, the proposed AMMGS can reduce the effort in recording the minutes. All meeting members can concentrate on the meeting; taking minutes is unnecessary. The AMMGS includes speaker identification for Mandarin Chinese speakers, keyword spotting, and speech recognition. Transferring learning on YAMNet lets the network identify specified speakers. So, the proposed AMMGS can automatically generate conference/meeting minutes with labeled speakers. Furthermore, the AMMGS applies the Jieba segmentation tool for keyword spotting. The system detects the frequency of words’ occurrence. Keywords are determined from the highly segmented words. These keywords help an attendant to stay with the agenda. The experimental results reveal that the proposed AMMGS can accurately identify speakers and recognize speech. Accordingly, the AMMGS can generate conference/meeting minutes while the keywords are spotted effectively.

Keywords:

deep-learning neural networks; transfer learning; speaker identification; speech recognition; conference minute generation

1. Introduction

Meetings are often needed to discuss problems. The most troublesome part of the meeting is the meeting minutes. In a lengthy agenda, the participant keeping record need to concentrate fully to avoid missing essential matters, and the record is also a waste of human resources. Solving the problem of how to quickly and automatically generate the content of the meeting minutes is crucial. During the meeting, there will also be many events that interrupt the progress of the meeting. Participants may temporarily leave the meeting for a variety of reasons. When they re-enter the meeting, it takes a while to understand the progress and the topic of discussion before continuing to participate. Currently, being able to record meetings with keyword reminders is essential. Suppose a system can automatically generate keywords and meeting minutes. In that case, participants can understand the critical points of the meeting when they return, which helps them enter the meeting discussion state quickly. This study aims to implement a system for automatically generating meeting minutes; meanwhile, the system can simultaneously identify the speakers and meeting keywords.

During the conference/meeting, accurately segmenting speech for subsequent speaker identification and speech recognition is crucial amidst various background noises. Hershey et al. [1] used various CNN architectures to classify the soundtracks of a dataset with 5.24 million hours of training videos, which contain 30,871 video-level labels. The convolutional neural network (CNN) is effective in image recognition and audio classification. This study examined various deep-learning neural networks, including AlexNet [2], VGG [3], Inception [4], and ResNet [5]. The effect of varying the size of the label vocabulary and training sets has been investigated. Experimental results reveal that the analogous CNNs used in image classification perform well in audio classification. Increasing training and label sets can improve performance. For the acoustic event classification, a model using embeddings from those classifiers performs better than raw features on the audio set [6].

Many novel studies have been presented for speaker identification [7,8,9,10,11]. Kabir et al. [7] described the main aspects of automatic speaker recognition (ASR), such as speaker verification, identification, and diarization. The performance of current speaker recognition systems was investigated in this survey. A few unsolved challenges, limitations, and possible improvement methods for speaker recognition were presented. Snyder et al. [8] used data augmentation to improve the performance of deep neural network (DNN) embeddings for speaker recognition. The DNN is trained to discriminate speakers so the network can map variable-length utterances to fixed-dimensional embeddings called x-vectors. This study adds noise and reverberation for data augmentation. By comparing the x-vectors and i-vector on speakers in the Wild and NIST SRE 2016 Cantonese, data augmentation benefits the x-vectors while it is not helpful to the i-vectors. Jahangir et al. [9] proposed fusing Mel-frequency cepstral coefficients (MFCC) and time-based features (T-MFCC) to improve the performance of a text-independent speaker identification system. The T-MFCC features were fed to a deep neural network (DNN). The results reveal that performance outperforms existing baseline MFCC and time-domain features on the LibriSpeech dataset [6]. Salvati et al. [10] utilized time-domain and frequency-domain features to increase the robustness of speaker identification in noisy environments. This method uses deep neural networks to analyze raw waveforms and cepstral coefficients, where the network includes CNNs and fully connected neural network layers. Hamsa et al. [11] proposed an end-to-end framework for speaker identification. This method utilizes a pre-trained DNN mask and voice VGG. Results confirm the technique works well in various emotional conditions by evaluating the Ryerson audio–visual dataset. Nassif et al. [12] proposed using CapsNet for speaker identification models in emotional speech signals. This model improves the CNNs by exploiting the spatial association among low-level features. Tsalera et al. [13] presented a study on the performance of pre-trained CNNs for sound classification. CNNs were designed for image recognition and then extended toward sound recognition by transfer learning. The invested pre-trained CNNs include GoogLeNet, ShuffleNet, SqueezeNet, YAMNet, and VGGish.

During the meeting, generating meeting records automatically through speech recognition is essential. Nedjah et al. [14] proposed using neural network experts for speech recognition. Each expert considers a delimited area of this decision space. A recurrent neural network is performed as post-processing. This method investigates the clustering of similar phonetics classes and the imbalanced distribution of samples in the training set to improve accuracy. Almadhor et al. [15] proposed a spatial–temporal dysarthric speech recognition system. It uses spatial CNN and a multi-head attention transformer to extract speech features. The transformer learns the phoneme shapes. To overcome the small size of the speech database, this method employed transfer learning to maintain the performance by generating synthetic leverage and visuals. Wang et al. [16] developed a multi-modal Mandarin corpus containing air-conducted and bone-conducted speech to form a large corpus for speech recognition. Then, they utilized a multi-modal-conformer speech recognizer to extract semantic embeddings from the air-conducted and bone-conducted speech by a conformer encoder and a transformer decoder. Hence, the semantic embeddings of the two speech sources are fused to improve the recognition performance. Cheng et al. [17] proposed integrating an attention end-to-end model, a hybrid Gaussian mixture model, and a hidden Markov model for speech recognition. In addition, frame-wise time alignment is also used to capture word time stamps, yielding lower latency for online recognition. Yolwas and Meng [18] proposed a supervised and unsupervised multi-task learning model for speech recognition. This method uses the pre-trained wav2vec 2.0 model as a shared encoder and the generative adversarial network into an end-to-end network, which provides a low-resource language speech recognition solution. Wei et al. [19] proposed using a cross-modal conformer encoder/decoder for conversational speech recognition. This method combines pre-trained speech and text models through an encoder and a mask, yielding rich speech context extraction.

This study implements a system for automatically generating meeting minutes; meanwhile, the system can identify the speakers and meeting keywords simultaneously. This system uses a pre-trained Yet Another Mobile Network (YAMNet) convolutional neural network [20] incorporated with transfer learning, enabling the YAMNet to identify speakers with a small amount of corpus. The meeting minutes are completed through speech recognition by Google-speech-to-text API.

The Jieba word segmentation tool [21] generates keywords based on speech recognition results. Accordingly, the proposed system integrates speaker recognition, speech recognition, and critical word recognition for generating meeting minutes and keyword hints during the conference/meeting. In addition, a graphical user interface (GUI) is also provided for practical applications. The major contribution of this study is the integration of advanced technologies of speaker recognition, speech recognition, keyword prompts, and automatic generation of meeting minutes as an application system that can be automatically archived.

The remainder of this paper is structured as follows: Section 2 presents the proposed system for generating conference/meeting minutes. Section 3 outlines the experimental findings, while Section 4 provides the concluding remarks.

2. Proposed Conference/Meeting Minute Generation System

Figure 1 shows the block diagram of the proposed automatic meeting minute generation system (AMMGS). The AMMGS captures the speech signals through the microphone and keeps sampling the speech frame by frame. Hence, the trained Speaker YAMNet will be used for speaker identification and speech recognition on the sampled speech through the Google speech recognition network. Moreover, the Jieba Chinese word segmentation tool [21] breaks down the words and then displays the results of the keywords in the graphic user interface (GUI). The result of speech-recognized meeting minutes is generated and can be downloaded.

2.1. Speaker YAMNet

YAMNet is an 86-layer deep learning neural network, with layer 1 as the image input layer, layers 2 to 83 as the convolutional layer, layer 84 as the fully connected layer, and layer 86 as the recognition layer. Therefore, YAMNet has an intense convolutional layer, which can effectively analyze the spectrum variations in the spectrogram so that the fully connected layer can recognize the speakers. YAMNet is a pre-trained model with 86 layers of neural network; each layer’s detailed functions and specifications are shown in Table A1.

YAMNet uses AudioSet, an open-source corpus, to learn 521 categorizations, and mobile devices can run the model. To match speech spectral features with human ear perception, YAMNet uses Mel-frequency cepstral coefficients (MFCC) as features and combines several frames to form a Mel cepstral spectrogram, which is used as the input image of YAMNet.

In this study, we use the Libri Speech open-source corpus as the training data, an English-based corpus containing 1000 h of speech data sampled at 16 kHz and after careful alignment and cutting. The corpus is categorized into clean speech and noisy speech. In order to speed up the model training, we only use 100 h of clean data for YAMNet training and set 80% of the training data as the training set and the remaining 20% as the validation set. The frame length is 1600, and four frames are grouped together to form a short speech segment. It is helpful to view the trajectory changes. Each speech frame is transformed to the frequency domain by the Fast Fourier Transform (FFT), given as

S (m, f) = \sum_{t = 0}^{N - 1} s (t + m M) \cdot h (t) \cdot e^{- j (2 π / N) f t}

(1)

where f denotes the frequency index, h(t) is the Hanning window of size N (512), and M (256) is the frame shift.

In turn, the spectral magnitude is calculated. Filtering the spectral magnitude by the Mel filter banks obtains the Mel spectrograms

S_{M e l} (m, f)

, which the melSpectrogram function in Matlab with version R2024a or later can extract. To prevent the log Mel spectrum from being infinite, we add a small value

ϵ

in the denominator, given as

S_{M e l}^{'} (m, f) = \log (S_{M e l} (m, f) + ϵ)

(2)

where

ϵ

is empirically chosen to 0.001.

The log Mel spectrogram

S_{M e l}^{'} (m, f)

is input into the YAMNet neural network. After the features are intercepted in the convolutional layer, the result goes to the pooling layer to retain the maximum features of the speakers in the speech signal. Finally, the fully connected layer is used as a classifier to identify the speakers.

To achieve the goal of Chinese speaker identification, we first train the YAMNet model with the Libri Speech open-source corpus in English, then use the pre-train YAMNeT model through transfer learning, where the 84–86th network layers are re-trained by the Chinese corpus for identifying Mandarin Chinese speakers. The corpus is divided into 80% for the training set and 20% for the validation set.

VGGish [1] is another deep sound CNN based on the VGG. It can be applied to speaker identification through transfer learning. The comparisons between YAMNet and VGGish are shown in Table 1.

VGGish extracts audio features based on the architecture of the VGG network, which consists of a convolutional layer and a fully connected layer. The main structure of VGGish consists of convolutional layers and fully connected layers. VGGish has four convolutional blocks; each contains 2 to 3 layers, totaling 13 convolutional layers and 3 fully connected layers.

YAMNet, a lightweight model based on MobileNet V1, is a practical choice for fast recognition applications. Its 26 convolutional layers, primarily depth-separable, ensure efficient feature acquisition. The architecture is completed with a single fully connected layer, enhancing its practicality.

In terms of the number of layers, YAMNet has more layers than VGGish, mainly because YAMNet uses more deeply separable convolutional layers to improve the computational efficiency and reduce the number of parameters, which makes YAMNet more suitable for resource-constrained environments while maintaining the performance.

As in the experiments in [13], VGGish obtained a lower performance for the Urban Sound8k, ESC-10, and Air Compressor datasets, while YAMNet achieved a better performance. Our dataset’s training trajectories for VGGish and YAMNet are shown in Figure 2.

As shown in Figure 2, the accuracy rate of VGGish can reach 99.0%, and YAMNet can reach 98.71%, which is comparable to the performance of VGGish. Although the number of parameters trained by YAMNet is much smaller, the speaker identification performance is comparable to that of VGGish; thus, YAMNet is used as the framework in this paper. In addition, many pre-trained audio neural networks, such as OpenL3, WaveNet, AudioSet, DeepSpeech, and SoundNet, can be applied to speaker identification with transfer learning.

Transfer Learning on the YAMNet

Transfer Learning can effectively utilize existing knowledge to solve new tasks. Its main advantage lies in its ability to train neural networks with a small amount of data, and its primary functions include knowledge transfer of pre-trained models, feature extraction, fine-tuning, avoiding over-simulation, and accelerating the training process.

Regarding knowledge transfer of the pre-trained model, the pre-trained YAMNet is trained on a large and diverse Libri Speech open-source corpus. YAMNet has already learned a large number of audio features. Therefore, when migrating to the new task of Chinese speaker recognition, these learned English features can be directly applied to the new Chinese data, thus significantly reducing the need for Chinese Mandarin-speaking data. In feature extraction, migratory learning uses the first 83 layers of the pre-trained YAMNet as the feature extraction network and then trains based on these extracted features. Since the first 83 layers have already learned the features of speech, the neural network in the 84–86th layers for migratory learning only needs to focus on learning the features of Chinese speakers and thus requires less Chinese Mandarin speech data.

In fine-tuning, the convolutional layers in layers 2–83 of the pre-trained YAMNet are fixed, and only the neural network weights in layers 84–86 are adjusted. The knowledge of YAMNet learned on the Libri speech dataset is retained, and only the neural network weights in layers 84–86 need to be adjusted on the new Chinese speech data to adapt to the task of recognizing Chinese speech. In the prevention of over-simulation, when training the YAMNet neural network on a relatively small number of Chinese speech datasets, the relocation learning fixes the weights of layers 2–83 of the pre-trained model and trains only the neural network of layers 84–86, which improves the generalization ability of the model and reduces the over-simulation problem. The training time can be drastically shortened using pre-trained YAMNet because YAMNet has already learned many essential speech features and only needs training on the Chinese Mandarin speech dataset to achieve better results.

In implementing transfer learning, we remove layers 84 to 86 of pre-trained YAMNet and reconnect the untrained layers, including the fully connected, softmax, and classification layers. The YAMNet is trained through transfer learning by a self-recorded Chinese Mandarine speech corpus, so the Speaker YAMNet model is obtained. The learning rate changes the accuracy rate of the model. The learning rate’s effect on the validation set’s accuracy rate is shown in Table 2, and the training trajectory is shown in Figure 3. The results show that the learning rate is set to 0.0001, enabling the accuracy rate of the validation set to reach 98.89%.

2.2. Google Speech Recognition

The Google speech recognition system has excellent speech recognition capabilities and includes the following features: speech modification, domain-specific modeling, and streaming speech recognition. Therefore, this study uses the streaming speech recognition of Google-Speech-to-Text API to accomplish the speech recognition function. In this study, we used a microphone to capture the voice and uploaded the stream to Google Cloud to quickly obtain the speech recognition result. During the experiment, the recognized language is set to Chinese, and the recognized Chinese recognition results are concatenated to obtain the text content of the meeting.

In order to evaluate the performance of Google API speech recognition, Chinese Mandarin speech can be represented by word error rate (WER), given in Equation (3) as follows:

W E R % = \frac{S + D + I}{N} \cdot 100 %

(3)

where S, D, I, and N denote the word numbers of substitution, deletion, insertion, and total testing, respectively.

As shown in Equation (3), the speech recognition error arose from the substitution, deletion, and insertion. The smaller the value of WER, the better the speech recognition performance. The WER can reach 0% in the experiments for some short utterances. The average WER is 4.5%.

Google Speech-to-Text API is trained with millions of hours of speech data and billions of sentences to recognize multiple languages and accents. This model can accurately recognize speech content for general language sentences and specific topics due to the rich training vocabulary.

2.3. Jieba Word Segmentation

In order to realize the effect of keyword prompting, it is necessary to pre-process the recognition results. This study uses the Jieba Chinese word segmentation tool to obtain the keywords as a reference for prompting in the meeting. The main functions of the Jieba Chinese word segmentation tool are as follows:

Rule disambiguation: Based on the preset dictionary, the disambiguation of the text will match the sentences in the text word by word with the dictionary and then disambiguate if the word is found.
Statistical word segmentation: When the frequency of conjoined words in a text increases the probability that a conjoined word is a word. So, the Jieba system utilizes the frequency of occurrence of connected words for statistical purposes. When the frequency is higher than a certain threshold, it can be determined that the phrase is a word.

The Jieba segmentation tool provides three modes for users to set up according to their functions, which are described below:

Precise mode: Suitable for text analysis to make precise textual terminology.
Full mode: This mode is suitable for text analysis, enabling a precise judgment of the text.
Search engine mode: This mode is suitable for search engine terminology. It is based on the precise mode, and long words are again recorded in detail. The main advantage is that it improves the recall rate.

The main advantage is increasing the recall rate. In this study, the precise mode is used, and the primary consideration is providing precise keywords. Through the Jeiba tool, the keywords can be extracted from the audio content, which can be used in the dialogue of an actual meeting. An example of the segmented result of the dialogue is shown in Table A3, which shows that this is a meeting about environmental protection, and the keywords extracted are “huan bao”, hui yi”, and “ dian zi”. However, there is still one keyword “wo men” that is not suitable. We can collect common pronouns and train a neural network to learn suitable keywords, which will help improve the keywords’ accuracy in the meeting.

3. Experimental Results

In this study, we train a model to recognize Chinese speakers using the Libri Speech open-source corpus to learn speaker features. Since the corpus is in English, we recorded a Chinese corpus for meeting members, which contained six Chinese speakers, and each speaker had more than 150 segments of speech data. Transfer learning on the YAMNet requires only a small quantity of data for the self-recorded corpus. A Speaker YAMNet can be achieved.

To evaluate the model’s recognition performance on the six speakers, we let multiple speakers speak consecutively in a staggered manner. The Speaker YAMNet identifies the captured audio sequence. The results are then evaluated using precision, recall, and F1 measures. Precision measures the proportion of correctly identified speakers out of all identified speakers, recall measures the proportion of correctly identified speakers out of all actual speakers, and F1 is the harmonic mean of precision and recall. These measures are used to assess the model’s speaker recognition effectiveness.

In generating meeting minutes, we combine three functions: speaker recognition, speech recognition, and keyword prompting. Speech recognition is accomplished through the Google Speech-to-text API, and the recognition result is then used to generate keywords through the Jieba segmentation tool. The recognition results of the two have a high accuracy rate; thus, the AMMGS can be used in a practical environment.

3.1. Speaker Identification Results

In order to analyze the recognition performance of Speaker YAMNet for six speakers, we randomly sampled ten speech data of six speakers and concatenated them. We ignored the blank segments in the concatenated speech signals as a test set to assess the performance. The recognition precision, recall rate, and F-measure in the model, among which the number of speech frames for “Blue” is 349, 449 frames for “Cindy”, 347 frames for “Ha”, 365 frames for “Yi”, 412 frames for “Jo”, and 500 frames for “Rabby”, as shown in Table A2.

The precision rate, recall rate, and F-measure are key metrics in evaluating the performance of Speaker YAMNet. The precision rate P, for instance, is a crucial measure that can be computed by

P % = N_{c} / (N_{c} + N_{f p}) \cdot 100 %

(4)

where N_c represents the number of correctly identified sound frames; N_fp denotes the number of false positives.

As shown in Equation (4), the larger the precision rate P, the better the performance is. In addition, the recall rate R can be expressed by

R % = N_{c} / (N_{c} + N_{f n}) \cdot 100 %

(5)

where N_fn represents the number of false positives and false negatives.

As shown in Equation (5), the larger the recall rate R, the better the performance is. The recall rate R denotes how many actual speakers can be correctly identified among the target speakers. That is, the more correctly identified speakers there are, the higher the recall rate is. Sometimes, a speaker identification system may obtain a high precision rate. However, the recall rate is low. The overall performance could be more acceptable.

Conversely, a speaker identification system may result in a high recall rate, but the precision rate is low. The overall performance could be more satisfactory. Accordingly, the F-measure should be employed to measure the overall performance, given as

F - m e a s u r e = (2 \cdot P \cdot R) / (P + R) \cdot 100 %

(6)

The speaker identification results in terms of precision rate, recall rate, and F-measure are shown in Table 3. The target speech frames for the Blue and Jo speakers are all correctly identified; thus, these two speakers obtain the best performance. Taking the speaker Ha as an example, the number of correctly identified speech frames is 448, while the number of false rejected frames is 5. So the recall rate is 98.56% (342/(342 + 5)). Because there is no falsely identified Ha speaker’s frame, the precision rate is 100%. The overall performance in terms of the F-measure is 99.27%. By observing the identification results, most incorrectly identified speech fames appeared at the boundary of changing speakers among the Cindy and Yi speakers. The primary reason is that Ha speaker’s intonation and spectrum characteristics are similar to Cindy’s and Yi’s, causing the Speaker YAMNet to fail to recognize the speech frame. The Ha speaker was incorrectly identified to the other two speakers. The precision and recall rates of the Speaker YAMNet for the six speakers are both above 99%, which shows that the Speaker YAMNet can effectively identify the speech frames of the six speakers. The average precision rate, recall rate, and F-measure are 100%, 99.67%, and 99.83%, respectively. Accordingly, the Speaker YAMNet can accurately identify the target speakers for specific members.

Figure 4 shows the waveform plots and identified speakers. While speaking, the Speaker YAMNet can simultaneously identify which speaker is speaking. The title displays the identified speakers, Blue and Cindy, above the waveform plot.

Figure 5 shows the confusion matrix of identified results by the Speaker YAMNet. Only eight frames are incorrect in the 2422 test frames, with an accuracy rate of 99.67%. We find that the incorrectly identified frames mainly appear in the frames of the first half of the speech after speaker switching. The misidentified frames mainly appear in the frames at the beginning of the changing speaker. In addition, the similar intonation and speech spectrum characteristics among the speakers Ha, Cindy, and Yi caused the Speaker YAMNet to misidentify the speaker Ha as Cindy (three frames) and Yi (two frames). The speech contents of the speakers Cindy and Yi were misidentified as the speaker Ha by one frame each. So, speakers with similar intonation and speech spectrum characteristics were misidentified as the speaker Ha. Therefore, there is confusion among speakers with similar tonal and spectral characteristics. The speaker “Rabby” was misclassified as “Jo” in only one frame; the confusion level is low.

3.2. Automatic Conference/Meeting Minute Generation System

Figure 6 shows the graphic user interface (GUI) of the proposed AMMGS. As shown in Figure 6a, area A displays real-time speech waveforms, area B displays meeting keywords, and area C displays speech recognition results. The buttons from left to right are play, stop, download, and close. As a user presses the play button, the system will start playing the speech data and perform speech recognition and speaker identification, as shown in Figure 6b. Area A shows the speech waveform; the title displays the identified speaker, and text area C shows the speech recognition results. Figure 6b shows an example where signals are identified frame by frame; the identified speaker is Cindy (marked by a rectangle), as shown in the title of this figure, while the speech recognition result is presented below.

Users can download the conference/meeting minutes by pressing the download button. The system will pop up the archive screen, and the user can select the file storage location, as shown in Figure 6c. Then, specify the file stored in the path and confirm that the download is completed, as shown in Figure 6d. The conference/meeting minutes contain the speech recognition results and the identified speaker.

The hyperlink to demonstration videos of the proposed conference/meeting minute system is given below:

https://www.youtube.com/watch?v=r_QWkh5WcmI (accessed on 23 June 2024).

As generative AI technology is developing rapidly, OpenAI can be integrated into the proposed system in the future. Users only need to apply for an API key from OpenAI in San Francisco, California, USA, to use the text summation function. The summarize_text function can generate the text summation result, which is helpful in enhancing the focus of the meeting. An example of a meeting’s text summation is shown in the appendix (Table A4).

Google API can recognize multiple languages so that it can generate recognition results. After collecting the speech data of multilingual speakers, through the transfer learning method introduced in Section 2.1, we can also recognize multilingual speakers, and this system can be developed into a multilingual conversation log generator.

4. Conclusions

In this study, we implement an AMMGS, which combines SpeakerYAMNet for speaker identification, Google Speech-to-text API for speech recognition, and the Jieba word segmentation tool for keyword prompts. This system can identify the speaker who is speaking while the speech recognition results and keyword prompts are simultaneously generated in the graphic user interface. We employ the YAMNet as the fundamental architecture to learn speaker characteristics through the English Libri speech open-source corpus. Then, we use Mandarin Chinese spoken speech with transfer learning to teach the Speaker YAMNet to identify Chinese speakers. Each Mandarin Chinese speaker has more than 150 pieces of speech data as a private Chinese corpus. Performing transfer learning lets YAMNet learn the characteristics of Chinese speakers and obtain Speaker YAMNet to identify Chinese speakers. Experimental results reveal that the average precision rate, recall rate, and F-measure can reach more than 99% in the self-recording speech as the testing data set. Accordingly, the Speaker YAMNet can effectively identify specified speakers in a meeting. The proposed AMMGS can be applied in practical environments.

Author Contributions

Conceptualization, C.-T.L.; methodology, C.-T.L. and L.-Y.W.; software, L.-Y.W. and C.-T.L.; validation, C.-T.L. and L.-Y.W.; formal analysis, C.-T.L.; investigation, C.-T.L. and L.-Y.W.; resources, L.-Y.W.; data curation, L.-Y.W.; writing—original draft preparation, C.-T.L. and L.-Y.W.; writing—review and editing, C.-T.L.; visualization, C.-T.L. and L.-Y.W.; supervision, C.-T.L.; project administration, C.-T.L.; funding acquisition, C.-T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, grant number NSTC 111-2410-H-035-059-MY3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This study did not require ethical approval.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Our gratitude goes out to the reviewers for their valuable comments which have improved the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

The Detailed structure of the YAMNet (Matlab, 2023) is listed in Table A1.

Table A1. Detailed structure of the YAMNet (Matlab, 2023).

Layer Index	Layer Name	Description
1	Input layer	Input image with the resolution: 96 × 64 × 1
2	Convolutional layer	32 3 × 3 × 1 convolution with the stride [2 2] and the same size padding
3	Batch Normalization	32 channels batch normalization
4	Activation	ReLU activation function
5	Depth-wise convolutional layer	32 3 × 3 × 1 convolution with stride [1 1] and the same size padding
6	Batch Normalization	32 channels batch normalization
7	Activation	ReLU activation function
8	Convolutional layer	64 1 × 1 × 32 convolution with stride [1 1] and the same size padding
9	Batch Normalization	64 channels batch normalization
10	Activation	ReLU activation function
11	Depth-wise convolutional layer	64 3 × 3 × 1 convolution with stride [1 1] and the same size padding
12	Batch Normalization	64 channels batch normalization
13	Activation	ReLU activation function
14	Convolutional layer	128 3 × 3 × 1 convolution with stride [1 1] and the same size padding
15	Batch Normalization	128 channels batch normalization
16	Activation	ReLU activation function
17	Depth-wise convolutional layer	128 3 × 3 × 1 convolution with stride [1 1] and the same size padding
18	Batch Normalization	128 channels batch normalization
19	Activation	ReLU activation function
20	Convolutional layer	128 1 × 1 × 64 convolution with stride [1 1] and the same size padding
21	Batch Normalization	128 channels batch normalization
22	Activation	ReLU activation function
23	Depth-wise convolutional layer	128 3 × 3 × 1 convolution with stride [2 2] and he same size padding
24	Batch Normalization	128 channels batch normalization
25	Activation	ReLU activation function
26	Convolutional layer	256 1 × 1 × 1 convolution with stride [1 1] and the same size padding
27	Batch Normalization	256 channels batch normalization
28	Activation	ReLU activation function
29	Depth-wise convolutional layer	256 3 × 3 × 1 convolution with stride [1 1] and the same size padding
30	Batch Normalization	256 channels batch normalization
31	Activation	ReLU activation function
32	Convolutional layer	256 1 × 1 × 256 convolution with stride [1 1] and the same size padding
33	Batch Normalization	256 channels batch normalization
34	Activation	ReLU activation function
35	Depth-wise convolutional layer	256 3 × 3 × 1 convolution with stride [2 2] and the same size padding
36	Batch Normalization	256 channels batch normalization
37	Activation	ReLU activation function
38	Convolutional layer	512 1 × 1 × 256 convolution with stride [1 1] and the same size padding
39	Batch Normalization	512 channels batch normalization
40	Activation	ReLU activation function
41	Depth-wise convolutional layer	512 3 × 3 × 1 convolution with stride [1 1] and the same size padding
42	Batch Normalization	512 channels batch normalization
43	Activation	ReLU activation function
44	Convolutional layer	512 1 × 1 × 512 convolution with stride [1 1] and the same size padding
45	Batch Normalization	512 channels batch normalization
46	Activation	ReLU activation function
47	Depth-wise convolutional layer	512 3 × 3 × 1 convolution with stride [1 1] and the same size padding
48	Batch Normalization	512 channels batch normalization
49	Activation	ReLU activation function
50	Convolutional layer	512 1 × 1 × 512 convolution with stride [1 1] and the same size padding
51	Batch Normalization	512 channels batch normalization
52	Activation	ReLU activation function
53	Depth-wise convolutional layer	512 3 × 3 × 1 convolution with stride [1 1] and the same size padding
54	Batch Normalization	512 channels batch normalization
55	Activation	ReLU activation function
56	Convolutional layer	512 1 × 1 × 512 convolution with stride [1 1] and the same size padding
57	Batch Normalization	512 channels batch normalization
58	Activation	ReLU activation function
59	Depth-wise convolutional layer	512 3 × 3 × 1 convolution with stride [1 1] and the same size padding
60	Batch Normalization	512 channels batch normalization
61	Activation	ReLU activation function
62	Convolutional layer	512 1 × 1 × 512 convolution with stride [1 1] and the same size padding
63	Batch Normalization	512 channels batch normalization
64	Activation	ReLU activation function
65	Depth-wise convolutional layer	512 3 × 3 × 1 convolution with stride [1 1] and the same size padding
66	Batch Normalization	512 channels batch normalization
67	Activation	ReLU activation function
68	Convolutional layer	512 1 × 1 × 512 convolution with stride [1 1] and the same size padding
69	Batch Normalization	512 channels batch normalization
70	Activation	ReLU activation function
71	Depth-wise convolutional layer	512 3 × 3 × 1 convolution with stride [2 2] and the same size padding
72	Batch Normalization	512 channels batch normalization
73	Activation	ReLU activation function
74	Convolutional layer	1024 1 × 1 × 512 convolution with stride [1 1] and the same size padding
75	Batch Normalization	1024 channels batch normalization
76	Activation	ReLU activation function
77	Depth-wise convolutional layer	1024 3 × 3 × 1 convolution with stride [1 1] and the same size padding
78	Batch Normalization	1024 channels batch normalization
79	Activation	ReLU activation function
80	Convolutional layer	1024 1 × 1 × 1024 convolution with stride [1 1] and the same size padding
81	Batch Normalization	1024 channels batch normalization
82	Activation	ReLU activation function
83	Pooling layer	Global average pooling layer
84	Fully Connected layer	521 Fully Connected layer
85	Activation	Softmax activation function
86	Classification Output	520 classes cross-entropy

Table A2 demonstrates the frame numbers of correctly identified Nc, false positive N_fp, false negative N_fn, and test N_t, respectively.

Table A2. Speaker identification results.

	Blue	Cindy	Ha	Yi	Jo	Rabby	Total
Nc	349	448	342	364	412	499	2414
N_fp	0	0	0	0	0	0	0
N_fn	0	1	5	1	0	1	8
N_t	349	449	347	365	412	500	2422

Table A3 shows an example of segmented keywords in an actual meeting.

Table A3. An example of segmented keywords.

Segmentation results	‘da jia’, ‘hao’, ‘gan xie’, ‘can jia’, ‘jin tian’, ‘de’, ‘hui yi’. ‘wo men’, ‘tao lun’, ‘de’, ‘shi’, ‘gong si’, ‘zai’, ‘huan bao’, ‘fang mian’, ‘de’, ‘gai jin’, ‘cuo shi’. ‘wo men’, ‘shou xian’, ‘kan kan’, ‘neng yuan’, ‘shi yong’, ‘qing kuang’. ‘gen ju’, ‘bao gao’, ‘wo men’, ‘de’, ‘dian li’, ‘xiao hao’, ‘guo gao’, ‘xu yao’, ‘xun zhao’, ‘jie neng’, ‘fang an’. ‘wo jian yi’, ‘an zhuang’, ‘tai yang’, ‘neng ban’. ‘zhe bu jin’, ‘ke yi’, ‘jie sheng’, ‘dian fei’, ‘hai neng’, ‘jian shao’, ‘tan’, ‘pai fang’. ‘zhe ge’, ‘ti yi’, ‘hen hao’. ‘wo men’, ‘xu yao’, ‘xiang xi’, ‘de’, ‘cheng ben’, ‘he’, ‘xiao yi’, ‘fen xi’. ‘zhang’, ‘zhu guan’, ‘ni’, ‘neng’, ‘fu ze’, ‘ma’? ‘mei’, ‘wen ti’, ‘wo hui’, ‘zai’, ‘xia’, ‘zhou’, ‘ti gong’, ‘yi fen’, ‘xiang xi’, ‘bao gao’, ‘bao han’, ‘an zhuang’, ‘he’, ‘wei hu’, ‘cheng ben’. ‘wo men’, ‘hai ying’, ‘gai’, ‘jian cha’, ‘ban’, ‘gong shi’, ‘de’, ‘la ji’, ‘fen lei’, ‘qing kuang’. ‘hen duo’, ‘ke hui shou’, ‘wu pin’, ‘dou’, ‘bei’, ‘hun ru’, ‘yi ban’, ‘la ji’. ‘shi’, ‘de’, ‘wo men’, ‘xu yao’, ‘jia qiang yuan gong’, ‘de’, ‘huan bao yi shi’. ‘huo xu’, ‘ke yi’, ‘ju ban’, ‘yi ge’, ‘pei xun’, ‘jiang zuo’. ‘ci wai’, ‘wo men’, ‘ke yi’, ‘zai’, ‘mei ge’, ‘ban’, ‘gong shi’, ‘fang zhi’, ‘ming que’, ‘biao zhi’, ‘de’, ‘la ji tong’, ‘fang bian’, ‘yuan’, ‘gong fen’, ‘lei’. ‘hao’, ‘zhu yi’. ‘wo men’, ‘hai ying’, ‘gai’, ‘kao lu’, ‘jian shao zhi’, ‘zhang’, ‘shi yong’, ‘tui dong’, ‘wu zhi hua’, ‘ban gong’. ‘wo men’, ‘ke yi’, ‘qi yong’, ‘dian zi’, ‘qian ming’, ‘xi tong’, ‘jian shao’, ‘da yin’, ‘he’, ‘fu’, ‘yin’, ‘xu qiu’. ‘zhe dui’, ‘huan bao’, ‘ye’, ‘you’, ‘bang zhu’. ‘na’, ‘jiu’, ‘zhe me ding’, ‘le’. ‘Li jing li’, ‘ni’, ‘fu ze’, ‘dian zi’, ‘qian ming’, ‘xi tong’, ‘de’, ‘tiao yan’, ‘he’, ‘shi shi’. ‘ling wai’, ‘wo men’, ‘ying gai’, ‘yu’, ‘gong’, ‘ying shang’, ‘he zuo’, ‘que bao’, ‘ta men’, ‘de’, ‘huan bao biao’,‘zhun da’, ‘dao’, ‘yao qiu’. ‘dui’, ‘gong ying’, ‘lian’, ‘de’, ‘huan bao’, ‘biao zhun’, ‘ye’, ‘fei chang’, ‘zhong yao’. ‘Zhang’, ‘zhu guan’, ‘ni’, ‘neng’, ‘gen’, ‘jin’, ‘zhe xiang’, ‘gong zuo’, ‘ma’? ‘mei’, ‘wen ti’, ‘wo hui’, ‘yu’, ‘zhu yao’, ‘gong ying shang’, ‘lian xi’, ‘liao jie’, ‘ta men’, ‘de’, ‘huan bao’, ‘cuo shi’, ‘bing’, ‘ti chu’, ‘gai jin’, ‘jian yi’. ‘hao’, ‘de’, ‘gan xie’, ‘da jia’, ‘de’, ‘jian yi’, ‘he’, ‘can’, ‘yu’. ‘wo men’, ‘xiazhou’, ‘zai’, ‘kai hui’, ‘zong jie’, ‘ge xiang’, ‘gong zuo’, ‘de’, ‘jin zhan’. ‘xie xie’, ‘da jia’, ‘wo hui’, ‘li ji’, ‘kai shi’, ‘zhun bei’, ‘dian zi’, ‘qian ming’, ‘xi tong’, ‘de’, ‘tiao yan’. ‘wo’, ‘ye’, ‘hui’, ‘ma’, ‘shang’, ‘zhuo shou’, ‘tai yang neng’, ‘ban’, ‘de’, ‘cheng ben’, ‘fen xi’, ‘he’, ‘gong’, ‘ying shang’, ‘de’, ‘huan bao’, ‘biao zhun’, ‘diao cha’. ‘na’, ‘jin tian’, ‘de’, ‘hui yi’, ‘jiu’, ‘dao’, ‘zhe li’, ‘xie xie’, ‘da jia’, ‘qi dai’, ‘xia’, ‘ci’, ‘hui yi’, ‘neng’, ‘kan dao’, ‘geng duo’, ‘jin zhan’
Generated keywords	[‘wo men’, ‘huan bao’, ‘hui yi’, ‘dian zi’]
Correct keywords	[‘huan bao’, ‘hui yi’, ‘dian zi’]

Table A4 shows an example of meeting text summation in an actual meeting.

Table A4. An example of meeting text summation.

Meeting topic	‘tao lun gong si zai huan bao fang mian de gai jin cuo shi’.
Energy use	‘mu qian dian li xiao hao guo gao’, ‘xu yao jie neng fang an’, ‘jian yi an zhuang tai yang neng ban’, ’jie sheng dian fei bing jian shao tan pai fang’.
Cost–benefit analysis	‘xu yao xiang xi de cheng ben he xiao yi fen xi’, ‘Zhang zhu guan fu ze xia zhou ti gong xiang xi bao gao’, ‘bao kuo anzhuang he wei hu cheng ben’.
Garbage classification	‘jian cha ban gong shi la ji fen lei qing kuang’, ‘zeng qiang yuan gong huan bao yi shi’. ‘jian yi ju ban pei xun jiang zuo’, ‘bing zai mei ge ban gong shi fang zhi biao zhi qing xi de la ji tong’.
Reduce paper use	‘tui dong wu zhi hua ban gong’, ‘qi yong dian zi qian ming xi tong’. ‘Li jing li fu ze dian zi qian ming xi tong de tiao yan he shi shi’.
Supply chain environmental standards	’yu gong ying shang he zuo’, ‘que bao huan bao biao zhun da dao yao qiu’. ‘Zhang zhu guan fu ze gen jin gong ying shang de huan bao cuo shi bing ti chu gai jin jian yi’.
Meeting summary	’gan xie da jia de jian yi he can yu’, ‘xia zhou zong jie ge xiang gong zuo jin zhan’.

References

Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 2818–2826. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Kabir, M.M.; Mridha, M.F.; Shin, J.; Jahan, I.; Ohi, A.Q. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access 2021, 9, 79236–79263. [Google Scholar] [CrossRef]
Snyder, D.; Romero, D.G.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Jahangir, R.; The, Y.W.; Memon, N.A.; Mujtaba, G.; Zareei, M.; Ishtiaq, U.; Akhtar, M.Z.; Ali, I. Text-independent speaker identification through feature fusion and deep neural network. IEEE Access 2020, 8, 32187–32202. [Google Scholar] [CrossRef]
Salvati, D.; Drioli, C.G.; Foresti, L. A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients. Expert Syst. Appl. 2023, 222, 119750. [Google Scholar] [CrossRef]
Hamsa, S.; Shahin, I.; Iraqi, Y.; Damiani, E.; Nassif, A.B.; Werghi, N. Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. Expert Syst. Appl. 2023, 224, 119871. [Google Scholar] [CrossRef]
Nassif, A.B.; Shahin, I.; Elnagar, A.; Velayudhan, D.; Alhudhaif, A.; Polat, K. Emotional speaker identification using a novel capsule nets model. Expert Syst. Appl. 2022, 193, 116469. [Google Scholar] [CrossRef]
Tsalera, E.; Papadakis, A.; Samarakou, M. Comparison of pre-trained CNNs for audio classification using transfer learning. J. Sens. Actuator Netw. 2021, 10, 72. [Google Scholar] [CrossRef]
Nedjah, N.; Bonilla, A.D.; Mourelle, L.M. Automatic speech recognition of Portuguese phonemes using neural networks ensemble. Expert Syst. Appl. 2023, 229, 120378. [Google Scholar] [CrossRef]
Almadhor, A.; Irfan, R.; Gao, J.; Saleem, N.; Rauf, H.T.; Kadry, S. E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition. Expert Syst. Appl. 2023, 222, 119797. [Google Scholar] [CrossRef]
Wang, M.; Chen, J.; Zhang, X.-L.; Rahardja, S. End-to-end multi-modal speech recognition on an air and bone conducted speech corpus. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 513–524. [Google Scholar] [CrossRef]
Cheng, G.; Miao, H.; Yang, R.; Deng, K.; Yan, Y. ETEH: Unified attention-based end-to-end ASR and KWS architecture. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1360–1373. [Google Scholar] [CrossRef]
Yolwas, N.; Meng, W. JSUM: A multitask learning speech recognition model for jointly supervised and unsupervised learning. Appl. Sci. 2023, 13, 5239. [Google Scholar] [CrossRef]
Wei, K.; Li, B.; Lv, H.; Lu, Q.; Jiang, N.; Xie, L. Conversational speech recognition by learning audio-textual cross-modal contextual representation. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2432–2444. [Google Scholar] [CrossRef]
MathWorks. YAMNet Neural Network. Available online: https://au.mathworks.com/help/audio/ref/yamnet.html (accessed on 18 June 2024).
Sun, J. Jieba Chinese Word Segmentation Tool. Available online: https://github.com/fxsjy/jieba (accessed on 5 May 2024).

Figure 1. Block diagram of the proposed conference/meeting minute generation system.

Figure 2. Training trajectories in the validation set; (a) Vggish; (b) YAMNet, where the orange color denotes the training loss.

Figure 3. Speaker YAMNet training trajectory with a learning rate of 0.0001, where the orange color denotes the training loss.

Figure 4. Waveform of speaker identification results, where the red box denotes the current analysis frame: (a) Blue speaker; (b) Cindy speaker.

Figure 5. Confusion matrix of identified speakers by the Speaker YAMNet, where the blue and light orange colors denote correctly and falsely identified results.

Figure 6. The GUI of the proposed AMMGS. (a) Arrangement of the system; (b) speaker identification and speech recognition results chart; (c) dialog of pressing the download button; (d) results of conference/meeting minutes, where the non-English terms are the generated Chinese characters.

Table 1. Comparisons of YAMNet and VGGish.

Network	Layers	Millions of Parameters	Classification Accuracy
Network	Layers	Millions of Parameters	Urban Sound8K	ESC-10	Air Compressor
YAMNet	86	3.7	96.24%	88.06%	100%
VGGish	24	62	95.68%	82.36%	99.97%

Table 2. Comparisons of the accuracy of Speaker YAMNet validation set with different learning rates.

Learning Rate	Epoch	Iteration	Validation Accuracy
0.005	10	5950	97.56%
0.001	10	5950	97.83%
0.0005	10	5950	97.65%
0.0001	10	5950	98.89%

Table 3. Speaker identification results in terms of precision rate, recall rate, and F-measure.

	Blue	Cindy	Ha	Yi	Jo	Rabby	Speaker_YAMNet
Precision (%)	100	100	100	100	100	100	100
Recall (%)	100	99.78	98.56	99.73	100	99.78	99.67
F-measure (%)	100	99.89	99.27	99.86	100	99.89	99.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, C.-T.; Wang, L.-Y. Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts. Appl. Sci. 2024, 14, 5718. https://doi.org/10.3390/app14135718

AMA Style

Lu C-T, Wang L-Y. Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts. Applied Sciences. 2024; 14(13):5718. https://doi.org/10.3390/app14135718

Chicago/Turabian Style

Lu, Ching-Ta, and Liang-Yu Wang. 2024. "Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts" Applied Sciences 14, no. 13: 5718. https://doi.org/10.3390/app14135718

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts

Abstract

Featured Application

Abstract

1. Introduction

2. Proposed Conference/Meeting Minute Generation System

2.1. Speaker YAMNet

Transfer Learning on the YAMNet

2.2. Google Speech Recognition

2.3. Jieba Word Segmentation

3. Experimental Results

3.1. Speaker Identification Results

3.2. Automatic Conference/Meeting Minute Generation System

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI