Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Audio File Recognition Using Hash Algorithm

Uploaded by

sanziwan.satyal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Audio File Recognition Using Hash Algorithm

Uploaded by

sanziwan.satyal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

5TH Online International Conference on Advanced Engineering and ICT- Conference Preceding (AEIPC)

ISSN: 2635-4586 August 13, 2020


112 ICAEIC-2020

Audio File Recognition Using Hash Algorithm

Sachit Satyal1,Sanjivan Satyal2

1
Kathmandu Engineering College, Tribhuvan University, kalimati, Kathmandu., Nepal
2
Institute of Engineering, Tribhuvan University,Pulchowk, Kathmandu,Nepal

Sawcheet@gmail.com1, sanziwans@pcampus.edu.np2

Abstract
META MUSIC is a concept that attempts to decrease time taken by a person for dealing with his/her music
library. Time is of essence, we individuals would prefer not to squander our valuable time doing things that
can be automated in this time and era and, this is the reason we are attempting to automate the task with our
straightforward yet instinctive music recognition using hash algorithm. In the 21st century, time is imperative
and it can be extremely tedious to deal with the music library with hundreds or thousands of songs. Finding
all the Meta data that is required on songs like album, release date, artist name and lyrics would be hard and
yet tedious. This paper attempts best to locate the correct information for the songs in the library. This paper
utilizes the audio fingerprinting and recognition algorithms to locate the ideal match for the tune and get all
the important Meta data of the music.

Keywords: Spectrogram, Meta data, Audio Fingerprint, Hash algorithm

1. Introduction
Music is a crucial part of our life. The role that music plays in our life is more than significant. Music
shapes cultural movements and unites people within generations. So, there is no denying that the growth of
Music Streaming services is phenomenal in recent years. With streaming services like Spotify, Apple music,
Google play Music, Deezer etc accounting for more than 51 percent of the market share for album
consumption. Many countries have moved from local music storage to cloud based streaming services which
means most people nowadays don’t even store music locally on their phones. All the music is stored on the
clouds and people stream them whenever they want. But for cases where there is no internet connection they
can still take their music offline whenever they want which includes all the metadata related to the song itself.
Having full details about their music is a must for many people which are one of many advantages of using
streaming services. So, if a user wants to know the singer of the song, the lyrics of the song or something
similar, they can simply look at the metadata provided by the streaming services [1].
In most of the undeveloped nations, users are not fortunate enough to have access to these services. So

Advanced Engineering and ICT-Convergence Proceedings (AEICP)


ISSN : 2635-4586
©ICTAES 2018
AEICP Vol. 3, No. 2 113

alternatively, we download our tunes from the web, which does exclude any metadata about the song. This can
baffle numerous music enthusiasts and can be exceptionally dull to gather information on the song if there is
a huge music library. This is the reason this paper explains the procedure to automate that task for music
enthusiasts with our straightforward yet intuitive Meta-Music.

2. Related Work
Back in 2003 inventor Avery Li-Chun Wang revealed to the world an Audio Search algorithm. They had
developed a flexible audio search engine and commercially deployed it. The algorithm was noise and distortion
resistant, computationally efficient, and massively scalable, capable of quickly identifying a short segment of
music captured through a cell phone microphone in the presence of foreground voices and other dominant
noise, and through voice codec compression, out of a database of over a million tracks. The algorithm used a
combinational hashed time-frequency constellation analysis of the audio, yielding unusual properties such as
transparency, in which multiple tracks mixed together may each be identified. Furthermore, for applications
such as radio monitoring, search times on the order of a few milliseconds per query are attained, even on a
massive music database. This was a hard problem, and at that time there were no algorithms known to them
that could satisfy all these constraints. So, they eventually developed their own technique that met all the
operational constraints.
Each audio file is “fingerprinted,” a process in which reproducible hash is extracted. Both “database” and
“sample” audio files are subjected to the same analysis. The fingerprints from the unknown sample are matched
against a large set of fingerprints derived from the music database. The candidate matches are subsequently
evaluated for correctness of match. Some guiding principles for the attributes to use as fingerprints are that
they should be temporally localized, translation-invariant, robust, and sufficiently entropic. The temporal
locality guideline suggests that each fingerprint hash is calculated using audio samples near a corresponding
point in time, so that distant events do not affect the hash [2]. Generating a signature from the audio is essential
for searching by sound. One common technique is creating a time-frequency graph called spectrogram. Any
piece of audio can be translated to a spectrogram. Each piece of audio is split into some segments over time.
In some cases adjacent segments share a common time boundary, in other cases adjacent segments might
overlap. The result is a graph that plots three dimensions of audio: frequency vs. amplitude (intensity) vs. time
[3]. In the case of Shazam, their algorithm then picks out points where there are peaks in the graph, labeled as
“higher energy content”. In practice, this seems to work out to about three points per song [2]. Focusing on
peaks in the audio greatly reduces the impact that background noise has on audio identification. Shazam builds
their fingerprint catalog out as a hash table, where the key is the frequency. They do not just mark a single
point in the spectrogram; rather they mark a pair of points: the “peak intensity” plus a second “anchor point”
[4]. This method of acoustic fingerprinting allows applications such as Shazam to have the ability to
differentiate between two closely related covers of the same song.

3 System Descriptions
3.1. Short Time Fourier Transform
Classical Fourier Transform (FT) is a well-known and widely used linear operator that is good for extracting
114 ICAEIC-2020

the frequency domain behavior of signals. However, the FT assumes that the analyzed signal is stationary and
has no time information. When the TF behavior of the musical signals is examined, it was seen that they have
non stationary characteristics. The short time Fourier transform (STFT) overcomes the lack of time information
drawback by considering an analysis window that has a specific time-frequency resolution property. Similarly
wavelet is not used for this purpose because they generally produce better results for high frequency content
and or short duration.
In STFT, a window, which is a function being zero-valued outside of some chosen interval, is employed for
extracting time information. During the STFT, a signal is multiplied by a window function and the product
located outside the window function is zero valued. As a result of this product, a sub-signal which is located
under the window is left and FT of this sub signal is calculated. Later, the midpoint of the window is shifted
and a new part of the original signal is processed resulting in the frequency content of sub-signal that is left
under the shifted window. This shifting process is applied till the end of the original signal and by this way the
complete TF representation of the signal is obtained. The formula of the STFT can be seen in below:

Where, (t) is the window function and is the shifting parameter. Another important parameter in the STFT is
overlap ratio which measures the amount of information that is shared by the consecutive window locations
[5].
3.2 Spectrogram
A spectrogram is the visual representation of signal strength over time at different frequencies present in a
certain waveform. It is represented by a two-dimensional graph in which time is shown along the horizontal
axis, frequency along the vertical axis, and the amplitude of the frequency components at a particular time is
indicated by the intensity or color of that point in the graph. Low amplitudes are represented by dark blue
colors and stronger (or louder) amplitudes are represented by brighter colors up through red. It is computed
from the speech signal by applying Fast Fourier transform (FFT) to speech signal, which forms time-frequency
representation. In order to discover the frequencies at a moment in the signal, it is divided into small chunks
and FFT is applied to the speech waveform for each chunk.
Spectrograms have been used in a variety of speech analysis tasks including sound event classification [1,
speaker recognition, and SER, and speech recognition. Their suitability for acoustic content representation has
been exhibited in these prior works [3]. In this work, spectrograph is obtained to present frequency components
at different time periods. The high frequency components for the audio sample are defined by local maxima
and minimum amplitude value.

3.3 SHA-1 Algorithm


The Secure Hash Algorithm (SHA) was developed by the National Security Agency (NSA) and published
in 1993 by the National Institute of Standard and Technology (NIST) as a U.S. Federal Information Processing
Standard (FIPS PUB 180). The design of SHA introduced a new procedure which expands the 16-word
message block input to the compression function to an 80-word block among other things. The SHA-1
algorithm accepts as input a message with a maximum length of 264 -1 and produces a 160- bit message digest
AEICP Vol. 3, No. 2 115

as output [6]. For instance, message X is divided into block X1, X2, X3….Xn with suitable padding to separate
the block. Then it is subjected to compression function until all the blocks are iterated by compression function,
giving the final hash. This type of hash is used in the project as it is a very basic type of hash to implement,
even if it has the security breaches for uses in other purposes; such as password hashing; this was used in this
project as the hash was only a basis for identifying music uniquely and served no special security purpose.
Furthermore, it is of limited length (160 bits) which is very easy for handling and storing.

4. Methodology
4.1 Block Diagram of the System
First of all the song file whose metadata is to be fetched is read by the software and then it is checked if the
song has both artist name and song name as metadata. If both artist name and song name are present as metadata
then other metadata related to song are fetched from the iTunes and Genius server and then added into the
properties of the song’s file. Else the song is forwarded to fingerprinting module where the audio fingerprint
of the song is generated and then compared against the fingerprints stored in our database and when similar
fingerprint are detected, artist name and song name corresponding to the fingerprint hash value are retrieved
from database which are sent to iTunes and Genius Server for fetching remaining metadata.

Figure 1: Simplified Diagram of the System

4.2 Classification Algorithm and Method


It uses a combinatory hashed time and frequency pattern to recognize any song uniquely through the
determination of the frequency maxima and time offset provided by the given audiofile sample obtained from
the spectrogram generated by application of short-time fourier transform over such audio file. The hash is
generated through SHA1 (Secure Hash Algorithm) encoding that takes the string into the byte for the
conversion into a hash of desired length which uniquely identifies the song.
Audio files are “fingerprinted,” to produce reproducible hash which is then compared with “database”
116 ICAEIC-2020

samples. The unknown samples are compared against a large set of fingerprints from the music database and
evaluated for correctness. The main guiding principles for the attributes to use as fingerprints are that they
should be localized, translation-invariant, robust, and entropic with time. The temporal locality guideline
suggests that each fingerprint hash is calculated using audio samples near a corresponding point in time, so
that distant events do not affect the hash.
First of all STFT is applied to the given audio sample with sampling rate defined by Nyquist criteria. Then,
spectrograph is obtained showing frequency components at different time periods. The high frequency
components for the audio sample are defined by local maxima comparison and minimum amplitude value.
Fingerprint hashes are obtained by combinatorial hashing where an anchor point is chosen and paired up with
the various other high frequency components which defines the fan-out factor, yielding two frequency
components and time offset between them. Each hash can be formed into 32-bit unsigned integers with an
offset time of 32- bit which defines the beginning of the track to its anchor point. Database indexes are thus
created by running the same procedure through each audio track.
The songs were classified locally in postgreSQL database in two tables where one table stored hash and
offset of various songs, and another table with the metadata of various songs with various hash values.

5. Experimental Framework
The experiment was employed in python with postgreSQL database to store songs locally. Sampling rate of
about 44100 was set as default value for sampling each of the frequency with window size of 4096 for
application of the short time Fourier transform in each of the audio samples to obtain the spectrograph showing
the relation between the frequency with amplitude variation over time . The overlap ratio for each sequential
window was set to 0.5. Higher overlap ratio may allow higher granularity but results in potentially more
fingerprints.
Fan value indicates the degree with which a fingerprint can be paired with its neighbor. Higher fan value
indicates more comparison and potentially more fingerprints and thus cost, but potentially better accuracy in
determining song. Default fan value of 15 was chosen which a good value for a normal song track is. The
peaks were then temporarily sorted and hash was derived through SHA1 encoding.
The peak frequency with the time offset was calculated to yield the hash which was then converted to
hexadecimal format with fingerprint reduction of 20 bit, i.e. 20 bits were removed from the front of SHA1
hash in fingerprint calculation. These hashes were then compared to detect the music with the similar hash
from the large pool of database and then thus fetch the required metadata of the music.
The songs were classified locally in postgreSQL database in two tables where one table stored hash and
offset of various songs, and another table with the metadata of various songs with various hash values. The
local database has exactly two tables, named fingerprints and song. The fingerprint table stores hash value and
offset, and the song table stores info about song, i.e. song id, song name and fingerprint status (if the song is
fingerprinted or not)

6. Result and Discussion


For any audio sample, a hash-time offset pair was easily generated using STFT and for most of the English
music were already found in the iTunes and genius database pool. For some of the music such as of Nepali
AEICP Vol. 3, No. 2 117

origin whose metadata are not available in the server hashes for those songs can be produced and manually
entered into the metadata of local databases for future use.
The accuracy of the hash was extremely high even though trimming of the front 20 bit was done during
SHA1 hash implementation. The comparison of the only 20 neighboring points resulted in much efficient
performance of the algorithm, where trade off for speed seemed to be of negligible penalty. The Sorting of the
peak contributed to the fast and efficient determination of the frequencies maximas.
Though SHA1 seems to be less secure means of hash encoding, the length factor (160-bit) of the encoding
and easy usability of it resulted in much easier encoding of the data, as this research was only for experiment
and observation purpose, and in case of necessity and commercial implementation it can be replaced with other
forms of secure hash algorithms. The setting of fan value to only 15 was sufficient enough for the songs unlike
general speech and other audio sources which needed much higher fan out factor for generation of efficient
usable hash.
The experiment helped to determine the efficiency with which STFT can be implemented in various music
samples to determine the unique frequency component of the audio wave over various time intervals to give it
a unique universal identification with which it can be separated from other audio

Figure 2: A sample wave after STFT with various frequency

The figure above shows the various high frequency regions such as around as well as regions with very low
frequency .The maximum frequency among the high frequency regions are calculated by determining the local
maxima among the various neighboring points to select the frequency point which is highest among them.
Such selected points were taken while generating hash as they are robust and less prone to be affected by the
noise. The database tables fingerprint and songs are as follows. PostgreSQL database had the following
structure and field as shown in figure below:
118 ICAEIC-2020

Figure 3: fingerprint database table structure

Figure 4: Song database table structure


AEICP Vol. 3, No. 2 119

7. Conclusion
By taking the basic frequency component and Fourier transform theory as the
base, any audio can be easily analyzed, identified and compared with other
similar audio to retrieve various unknown information about the audio. It
supports that the use of the combinatory hash can be used in fast examination
and identification of similar type of music file within a very small time frame
with great efficiency. This method is simple yet effective to implement, it can
be of great use for most undeveloped nations.

References
[1] [Accessed: 14-Apr-2020] [Online]Available: https://www.cnet.com
/how-to/ best-music-streaming- service- for-2020- spotify-apple-
music-amazon-and-more/
[2] A. Wang, “An Industrial-Strength Audio Search Algorithm”,
2017. [Online] Available:
http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf.
[Accessed: 14-APRIL 2020].
[3] A. M. Badshah, J. Ahmad, N. Rahim and S. W. Baik, "Speech
Emotion Recognition from Spectrograms with Deep Convolutional
Neural Network,"2017 International Conference on Platform
Technology and Service (PlatCon), Busan, 2017, pp. 1-5.
[4] "How Shazam Works.", 2017. [Online]. Available:
http://laplacian.wordpress.com/2009/01/10/how- shazam-works.
[Accessed: 14-APRIL2020].
[5] A. Elbir, H. O. Ilhan, G. Serbes and N. Aydin, "Short Time Fourier
Transform based music genre classification," 2018 Electric
Electronics, Computer Science, Biomedical Engineerings' Meeting
(EBBT), Istanbul, 2018, pp. 1-4.
[6] H.B. Pethe, S.R Pande, “ An overview if Cryptographic Hash
Functions MD-5 and SHA”, 2016 National Conference on Recent
Trends in Computer Science and Information Technology, India,
2016,pp. 37-42

You might also like