Abstract
Notating or transcribing a music piece is very important for musicians. It not only helps them to communicate among each other but also helps in understanding a piece. This is very much essential for improvisations and performances. This makes automatic music transcription systems extremely important. Every music piece can be broadly categorized into two parts namely the lead section and the accompaniment section or background music (BGM). The BGM is very important in a piece as it sets the mood and makes a piece complete. Thus it is very much important to notate the BGM for properly understanding and performing a piece. One of the key components of BGM is known as chord which is constituted of two or more musical notes. Every composition is accompanied with a chord chart. In this paper, a long short term memory-recurrent neural network (LSTM-RNN)- based approach is presented for segregating musical chords from clips of short durations which can aid in automatic transcription. Experiments were performed on over 46800 clips and a highest accuracy of 99.91% has been obtained for the proposed system.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
A music piece is composed of musical notes. These notes occur in different combinations and timings which makes melodies different. In order to study such music compositions it is very important to notate or transcribe them. This not only helps in understanding them in a better way but also to communicate with other musicians. The BGM of a composition as important as the lead melody. It is the BGM which makes a piece sound complete and which goes on for almost the entire span of a piece. A change in the BGM can alter the mood of a composition and at times disrupt it completely. Thus it is very important to play the BGM flawlessly during performances to uphold the essence of a composition. One of the most important facets of BGM melody is known as a chord which is composed of two or more musical notes played simultaneously. Every composition has a chord chart associated with it whose transcription is essential.
Rajpurkar et al. [15] distinguished chords in real-time. They used hidden markov model (HMM) and Gaussian discriminant analysis in addition to chroma-based features and obtained an accuracy of 99.19%. Zhou and Lerch [18] used deep learning for distinguishing chords. They worked with 317 music pieces and obtained a recall value of 0.916 using max-pooling. Cheng et al. [4] distinguished chords for music classification and retrieval with the aid of N-gram technique and HMM. Different chord-based features like chord histogram and LCS were also involved in their experiments and a highest overall accuracy of 67.3% was obtained. Dylan Quenneville [14] has talked about multitudinous aspects of automatic music transcription. He has highlighted the basics of making music as well as that of transcription. He has talked about different techniques of pitch detection in the thick of fourier transform-based approaches, fundamental frequency-based approaches, harmonicity-based approaches to name a few.
Berket and Shi [3] presented a two phase model for music transcription. In the first phase, they used acoustic modelling to detect pitches and in the later phase it was transcribed. They worked with 138 MIDI files which were converted to audio. The train set consisted of 110 songs while the remaining were used for testing and reported results as high as 99.81%. Wats and Patra [17] used a non negative matrix factorization-based technique for automatic music transcription. They worked on the Disklavier dataset and obtained good results. Benetos et al. [1] presented an overview of automatic music transcription. They have touched on its various applications and challenges. They have also talked about several transcription techniques as well. Muludi et al. [12] frequency domain information and pitch class profile for chord identification. Their experiments involved 432 guitar chords and obtained an accuracy of 70.06%.
Osmalskyj et al. [13] used a neural network and pitch class profiles for guitar chord distinction. Their study involved other instruments in the thick of accordion, violin and piano. They performed instrument identification as well and obtained an error rate of 6.5% for chord identification. Benetos et al. [2] laid out different techniques and challenges which are involved in automatic music transcription. They have talked about various pitch tracking methods in the thick of feature-based approaches, statistical approaches, spectrogram factorization-based approaches and many more. They have also talked about several types of transcriptions including instrument and genre-based transcription as well as informed transcription. Kroher and Gomez [7] attempted to automatically transcribe flamenco singing from polyphonic tracks. They extracted predominant melody and eliminated contours of the accompaniments. Next the vocal contour was discretized into notes followed by assignment of a quantized pitch level. They experimented with three datasets totaling to more than 100 tracks and obtained results which was better than state of the art singing transcribers based on overall performance, onset detection and voicing accuracy. Costantini and Casali [5] used frequency analysis for chord identification. Experiments were performed with upto 4 note chords. Highest accuracies of 98%, 97% and 95% were obtained for the 2, 3 and 4 note chords.
Here, a system is proposed to distinguish chords from clips of very short duration. It works with LSTM-RNN based classification and has the potential of aiding in automatic music transcription for background music which is very vital. The system is illustrated in Fig. 1.
The rest of the paper consists of the details of dataset in Sect. 2. Sections 3 and 4 talk about the proposed method whose results respectively. Finally we have concluded in Sect. 5.
2 Dataset
Data is a very important aspect of any experiment. The quality of data plays a crucial part in development of robust systems as well. To the best of our knowledge, there is no publicly available dataset of chords and hence we put together a dataset of our own. In the present experiment, we consider two of the most popular chords from the major family (C and G) and two most popular chords from the minor family namely A minor (Am) and E minor (Em) [16]. The constituent notes of scales of the considered chords along with the notes of the chords is presented in Table 1. The chord pairs (G-Em) and (C-Am) have common notes which makes it difficult to distinguish them.
Volunteers were provided a Hertz acoustic guitar (HZR3801E) for playing the chords. They played different rhythm patterns and no metronome was used to allow relaxation with respect to tempo. Volunteers further used different type of plectrums which slightly change the sound thereby encompassing more variation. The audios were recorded with the primary line port of a computer having a motherboard (Gigabyte B150M-D3H). Further, studio ambience and use of pre amplifiers was avoided to ensure real world scenario. The audio clips were recorded in .wav format at a bitrate of 1411Â kbps.
Four datasets (D1-D4) having clips of lengths 0.25, 0.5, 1 and 2 s respectively were put together form the recorded data whose details are presented in Table 2. We worked with clips of such durations to test the efficiency of our system for short clips which is common in real world.
3 Proposed Method
3.1 Preprocessing
Framing. The clips were first subdivided into smaller segments called frames. This was mainly done to make the spectral contents stationary which otherwise show high deviations thereby making analysis a herculean task. The clips were divided into 256 point frames in overlapping mode with 100 common points (overlap) between two consecutive frames [11].
Windowing. Jitters are often observed in the frames due to loss of continuity at the boundaries. These disrupt frequency-based analysis in the form of spectral leakage. To tackle this, the frames are windowed with windowing function. Here we used hamming window [11] which is presented in Eq. (1).
Feature extraction where n is a sample point within a N sized frame.
3.2 Feature Extraction
Each of the clips were used for extraction of the standard line spectral frequency (LSF) features at frame level. LSF [11] was chosen due to its higher quantization power [10]. Here, a sound signal is represented as the output of a filter H(z) whose inverse is G(z) where G \(_{1\ldots .m}\) are the predictive coefficients
The LSF derived by decomposing G(z) into G\(_x\)(z) and G\(_y\)(z) which are detailed below
We had extracted 5, 10, 15, 20 and 25 dimensional features for the frames. Each of these dimensions correspond to bands that is 5 dimensional LSFs have 5 bands and so on. Next, these bands were graded in accordance with the total value of the coefficients. This band sequence was used as feature. It depicted the energy distribution pattern across the bands. Along with this, the mean and standard deviation of the spectral centroids per frame was also appended. When 5 dimensional LSF was extracted, a total of \(5 \times 440=2200\) coefficients were obtained for a clip of only 1Â s (1Â s clip produced 440 frames). This dimension varied with disparate length of the clips. The band grades along with the mean and standard deviation of the centroids produced a \(5+2=7\) dimensional feature when 5 dimensional LSFs were extracted. These were also independent of the clip lengths. So finally we obtained features of 7, 12, 17, 22 and 27 dimensions.
3.3 Long Short Term Memory-Recurrent Neural Network (LSTM-RNN) Based Classification
LSTM-RNN can preserve states as compared to standard neural networks [9] which makes them suitable for sequences. It further solves the vanishing gradient problem of simple RNNs [8]. A LSTM block comprises of a cell state and three gates namely forget gate, input gate and output gate. The input gate (\(i_n\)) helps to generate new state:
where \(Wt_i\) is the associated weight. The forget gate discards values form previous state to the present state:
where \(Wt_f\) is the associated weight. The output determines the next state as shown below:
where \(Wt_o\) is the associated weight. Our network comprised of a 100 dimensional LSTM layer. The output of this layer was passed through three fully connected layers of dimensions 100, 50 and 25 respectively. These layers had ReLU activation. The final layer was a 4 dimensional fully connected layer with softmax activation. We had initially used 5 fold cross validation with 100 epochs in our experiment and the network parameters were set after trials.
4 Result and Analysis
Each of the feature sets for the datasets D1-D4 were fed to the recurrent neural network as summarized in Table 3. It is observed that the best result was obtained for the 22 dimensional features on D3. To obtain better results, the training epochs were varied with 5 fold cross validation for 22 dimensional features of D3 as shown in Table 4. The best performance was obtained for 300 epochs. Increasing the training epochs even further led to over fitting and thus produced lower results. The confusions among the different classes for 300 iterations is presented in Table 5(a). It is observed that the highest confusion was among the minor chords. The clips were analyzed and it was found that the volunteers at times accidentally muted strings which interfered with the chord textures in the barred shapes. This could be one probable reason for such confusions.
In order to obtain further improvements, we varied the cross validation folds for 100 epochs for 22 dimensional features of D3. The obtained results are presented in Table 6. 20 folds produced the best result wherein the variation of the dataset was evenly distributed. The performance decreased on further increasing the folds of cross validation. The interclass confusions is presented in Table 5(b) wherein it is observed the chords C and Em were recognized with 100% accuracy. The confusions among the minor chords was also overcome in this setup. Finally the best fold value (20 fold) along with the best training epoch (300 epochs) were combined which produced an accuracy of 99.91 % (overall highest) whose confusions are presented in Table 5(c). It is observed that the confusions were exactly similar as compared to the 20 fold cross validation setup, only 1 more instance of G chord was identified correctly as compared to the former setup. Some of the other popular classifiers in the thick of bayesnet (BN), naïve bayes (NB), multi layer perceptron (MLP), random forest (RF), radial basis functional classifier (RBF) from [6] were also evaluated on D4 whose results are summarized in Table 7.
5 Conclusion
Here, a system is presented to distinguish chords from clips of short durations. The system works with LSTM-RNN based classification technique and produced encouraging results. In future, we will experiment with a larger set of chords and involve other instruments as well. We will introduce other tracks along with the chords to observe the system’s performance. We also plan to identify and discard silent sections in the clips to obtain better results. Finally, we will make use of other acoustic features coupled with different modern machine learning techniques to obtain further improvement in our results.
References
Benetos, E., Dixon, S., Duan, Z., Ewert, S.: Automatic music transcription: an overview. IEEE Sig. Process. Mag. 36(1), 20–30 (2018)
Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H., Klapuri, A.: Automatic music transcription: challenges and future directions. J. Intell. Inf. Syst. 41(3), 407–434 (2013)
Bereket, M., Shi, K.: An AI approach to automatic natural music transcription (2017)
Cheng, H.T., Yang, Y.H., Lin, Y.C., Liao, I.B., Chen, H.H.: Automatic chord recognition for music classification and retrieval. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1505–1508. IEEE (2008)
Costantini, G., Casali, D.: Recognition of musical chord notes. WSEAS Trans. Acoust. Music 1(1), 17–20 (2004)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Kroher, N., Gómez, E.: Automatic transcription of flamenco singing from polyphonic music recordings. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(5), 901–913 (2016)
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191. IEEE (2015)
Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)
Mukherjee, H., Dutta, M., Obaidullah, S.M., Santosh, K.C., Phadikar, S., Roy, K.: Lazy learning based segregation of Top-3 south indian languages with LSF-a feature. In: Santosh, K.C., Hegadi, R.S. (eds.) RTIP2R 2018. CCIS, vol. 1035, pp. 449–459. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9181-1_40
Mukherjee, H., Obaidullah, S.M., Santosh, K., Phadikar, S., Roy, K.: Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. Int. J. Speech Technol. 21(4), 753–760 (2018)
Muludi, K., Loupatty, A.F.S., et al.: Chord identification using pitch class profile method with fast fourier transform feature extraction. Int. J. Comput. Sci. Issues (IJCSI) 11(3), 139 (2014)
Osmalsky, J., Embrechts, J.J., Van Droogenbroeck, M., Pierard, S.: Neural networks for musical chords recognition. In: Journees d’informatique Musicale, pp. 39–46 (2012)
Quenneville, D.: Automatic Music Transcription. Ph.D. thesis, Middlebury College (2018)
Rajparkur, P., Girardeau, B., Migimatsu, T.: A supervised approach to musical chord recognition (2015)
Spotify, 6 Apr 2019. https://insights.spotify.com/us/2015/05/06/most-popular-keys-on-spotify/
Wats, N., Patra, S.: Automatic music transcription using accelerated multiplicative update for non-negative spectrogram factorization. In: 2017 International Conference on Intelligent Computing and Control (I2C2), pp. 1–5. IEEE (2017)
Zhou, X., Lerch, A.: Chord detection using deep learning. In: Proceedings of the 16th ISMIR Conference, vol. 53 (2015)
Acknowledgement
The authors would like to thank Mr. Soukhin Bhattacherjee, Mr. Debajyoti Bose of Department of Electrical, Power & Energy, University of Petroleum and Energy Studies for their help during the entire course of this work. They also thank WWW.PresentationGO.com for the block diagram template.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mukherjee, H., Dhar, A., Obaidullah, S.M., Santosh, K.C., Phadikar, S., Roy, K. (2019). Segregating Musical Chords for Automatic Music Transcription: A LSTM-RNN Approach. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11942. Springer, Cham. https://doi.org/10.1007/978-3-030-34872-4_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-34872-4_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34871-7
Online ISBN: 978-3-030-34872-4
eBook Packages: Computer ScienceComputer Science (R0)