Speech and Audio Coding
Speech and Audio Coding
Sponsored by the NSF Combined Research and Curriculum Development Grant 0417604
April 2006 Copyright (c) 2006 - Andreas Spanias II-1
Pedagogiesfor transition of
research to UG curriculum Summer Freshman
DEMO MODULES (DM) and Sophomore
Research Camps
ASU J-DSP Technology for
on-line Java Computer Labs
SP-COM Research
drawn from ASU SP -COM research Feedback/
Activities and from research
Improvement
published work from other universities
Channel
by
Andreas Spanias, Professor
DSP and Speech Processing Labs.
Dept. of Electrical Engineering
Arizona State University
Tempe, AZ 85287-5706
email: spanias@asu.edu
http://www.eas.asu.edu/~spanias
5. Algorithm Examples
6. Research / Remarks
Digital Speech
s (n) = s (nT ) = sα (t ) |t = nT
- Can be Manipulated with Software
-Error Control
0 T 2T ...
t n
x(t) Q x(n)
Quantization Considerations
Pitch Channel
Oscillator
Noise
Filter
EQLZR Modulator
Spectrum Channels 0-300~
Filter Filter
0-300~ 0-25~
EQLZR
H. Dudley, "Remaking Speech," J. Acoust. Soc. Am., Vol. 11, p. 169, 1939.
H. Dudley, "The Vocoder," Bell Labs. Record., 17, p. 122, 1939.
April 2006 Copyright (c) 2006 - Andreas Spanias II-11
20 Formant Structure
0.0
Amplitude
0
Magnitude (dB)
-1.0 -20
0 8 16 24 32 0 1 2 3 4
Time (mS) Frequency (KHz)
20
0.0
0
Amplitude
Magnitude (dB)
-1.0 -30
0 8 16 24 32 0 1 2 3 4
Time (mS) Frequency (KHz)
V/UV
VOCAL SYNTHETIC
gain TRACT
SPEECH
FILTER
b0
H ( z) = M
1+ ∑ai z −i
i =1
∈f (O ) = r ss (O )
m −1
r ss (m ) − ∑ a i (m − 1 )r ss (m − i )
order a m (m ) = i =1
∈ f (m − 1 )
ai (m ) = ai (m − 1) − am (m )am −i (m − 1) , 1 ≤ i ≤ m -1
index ∈f (m ) = (1 − (a m (m ))2 )∈ f (m − 1)
s(n)
+
^
Select + + s(n)
-
or Form gain
Excitation
+ +
A (z) A(z)
L
LTP LP
MSE W(z)
e c (k ) = s w − sˆ w0 − g k sˆ w (k )
swT sˆw (k )
gk = T
sˆw (k )sˆw (k )
∈ c (k ) = s s w − T w
T sˆ (k ) (s T
w )
2
sˆ w (k )sˆ w (k )
w
M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at
Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.
1
1 − 0.95 z −30
-10
0 0.5 0.9 1
30
Short Term
Predictor
25
H (z ) =
20 1
10
15 1 − ∑ ai z −i
i =1
10
-5
Perceptual Filter χ=0.9
-10 p
1 − ∑ ai z −i
W (z ) =
-15
0 100 200 300 400 500 600 i =1
p
1 − ∑ γ i ai z −i
i =1
1. Bit rate
Network or toll
Toll or Network quality refers to quality comparable
to the classical analog speech (200-3200 Hz)
Communications
Communications quality implies somewhat degraded
speech quality but adequate for cellular communications.
Synthetic
Synthetic speech is usually intelligible but can be
unnatural and associated with a loss of speaker recognizability.
- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification
- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on
Speech Coding, pp. 117-119, 1999
• Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-
127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)
• The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV
algorithm to be refined in the interim by participating companies according to the
publication below)
• Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and
eighth rate at 800 bps
• Pre-processing includes noise suppression similar to IS 127 EVRC
• Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core
technology also used in the ITU G.4 Conexant submission to ITU-4
• Performed better than IS-733 and IS-127 in tests with and without background noise
• Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with
background noise
REFERENCES:
[1] “The SMV algorithm selected for TIA and 3GPP2 for CDMA applications,” conference paper by Conexant systems, Y.Gao, E.
Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)
STANDARDS AT A GLANCE
• ITU Telephony
– G.711 PCM (64 kbps) late 60’s
– G.726 ADPCM (32/40/ 24/16 kbps) 1988
– G.728 LD-CELP coding (16 kbps) 1992
– G.723.1 True Speech (5.3/6.3 kbps) 1995
– G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998
– G.4kbps Toll quality at 4 kbps (on going)
• Non-ITU
– MPEG1/Audio (includes MP3), 1991
– MPEG2/Audio: 64 kbps (1992)
– MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)
– MPEG7/Audio: audio/speech/MIDI coding (ongoing)
• ETSI (GSM):
– 13 kbps RPE-LTP (Full rate GSM, 1988)
– 6.5 kbps VSELP (Half-rate GSM, 1993)
– 12.2 kbps EFR (Enhanced full-rate GSM, 1996)
– 12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)
• ARIB Japan
– Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP
– Half-rate PDC 3.45 kbps Multimode CELP`
Vocoder/Waveform/Hybrid
MOS PCM
Hybrid Coders ADPCM
1-5 SMV
CELP
Waveform Coders
MELP
LPC10e
Vocoders
1 2 4 8 16 32 64
Audio Coding
Selection of sinusoids based on perceptual criteriaT. Painter and A. S. Spanias, " Sinusoidal
Analysis-Synthesis of Audio using Perceptual Criteria,” Proc.. IEEE International Symposium on
Circuits and Systems (ISCAS-02), Phoenix, May 2002. - Research funded by Intel Corporation
Enhancing the Bandwidth of Speech Coders, ISCAS05, Visar Berisha, NSF