An Investigation Into the Real-Time Manipulation and Control of Three-Dimensional Sound Fields

Bruce Wiggins

An Investigation Into the Real-Time Manipulation and Control of Three-Dimensional Sound Fields

2004, University of Derby Doctoral Thesis

This thesis describes a system that can be used for the decoding of a three dimensional audio recording over headphones or two, or more, speakers. A literature review of psychoacoustics and a review (both historical and current) of surround sound systems is carried out. The need for a system which is platform independent is discussed, and the proposal for a system based on an amalgamation of Ambisonics, binaural and transaural reproduction schemes is given. In order for this system to function optimally, each of the three systems rely on providing the listener with the relevant psychoacoustic cues. The conversion from a five speaker ITU array to binaural decode is well documented but pair-wise panning algorithms will not produce the correct lateralisation parameters at the ears of a centrally seated listener. Although Ambisonics has been well researched, no one has, as yet, produced a psychoacoustically optimised decoder for the standard irregular five speaker array as specified by the ITU as the original theory, as proposed by Gerzon and Barton (1992) was produced (known as a Vienna decoder), and example solutions given, before the standard had been decided on. In this work, the original work by Gerzon and Barton (1992) is analysed, and shown to be suboptimal, showing a high/low frequency decoder mismatch due to the method of solving the set of non-linear simultaneous equations. A method, based on the Tabu search algorithm, is applied to the Vienna decoder problem and is shown to provide superior results to those shown by Gerzon and Barton (1992) and is capable of producing multiple solutions to the Vienna decoder problem. During the write up of this report Craven (2003) has shown how 4th order circular harmonics (as used in Ambisonics) can be used to create a frequency independent panning law for the five speaker ITU array, and this report also shows how the Tabu search algorithm can be used to optimise these decoders further. A new method is then demonstrated using the Tabu search algorithm coupled with lateralisation parameters extracted from a binaural simulation of the Ambisonic system to be optimised (as these are the parameters that the Vienna system is approximating). This method can then be altered to take into account head rotations directly which have been shown as an important psychoacoustic parameter in the localisation of a sound source (Spikofski et al., 2001) and is also shown to be useful in differentiating between decoders optimised using the Tabu search form of the Vienna optimisations as no objective measure had been suggested. Optimisations for both Binaural and Transaural reproductions are then discussed so as to maximise the performance of generic HRTF data (i.e. not individualised) using inverse filtering methods, and a technique is shown that minimises the amount of frequency dependant regularisation needed when calculating cross-talk cancellation filters.

UNIVERSITY OF DERBY AN INVESTIGATION INTO THE REAL-TIME MANIPULATION AND CONTROL OF THREEDIMENSIONAL SOUND FIELDS Bruce Wiggins Doctor of Philosophy 2004 Contents - ii - Contents Contents Contents ......................................................................................................... iii List of Figures ................................................................................................ vii List of Equations ...........................................................................................xvii List of Tables ................................................................................................ xix Acknowledgements........................................................................................ xx Abstract......................................................................................................... xxi Chapter 1 - Introduction ...................................................................................1 1.1 Background .......................................................................................1 1.2 The Research Problem......................................................................4 1.3 Aims and Objectives of the Research................................................6 1.4 Structure of this Report......................................................................8 Chapter 2 - Psychoacoustics and Spatial Sound Perception ...........................9 2.1 Introduction........................................................................................9 2.2 Lateralisation .....................................................................................9 2.2.1 Testing the Lateralisation Parameters. .....................................12 2.2.2 Analysis of the Lateralisation Parameters ................................19 2.3 Sound Localisation ..........................................................................24 2.3.1 Room Localisation ....................................................................24 2.3.2 Height and Distance Perception ...............................................29 2.4 Summary .........................................................................................32 Chapter 3 - Surround Sound Systems ...........................................................34 3.1 Introduction......................................................................................34 3.2 Historic Review of Surround Sound Techniques and Theory ..........34 3.2.1 Bell Labs’ Early Spaced Microphone Technique ......................34 3.2.2 Blumlein’s Binaural Reproduction System................................36 3.2.3 Stereo Spaced Microphone Techniques...................................41 3.2.4 Pan-potted Stereo ....................................................................43 3.2.5 Enhanced Stereo......................................................................45 3.2.6 Dolby Stereo.............................................................................46 3.2.7 Quadraphonics .........................................................................48 3.3 Review of Present Surround Sound Techniques .............................49 3.3.1 Ambisonics ...............................................................................49 - iii - Contents 3.3.2 Wavefield Synthesis .................................................................72 3.3.3 Vector Based Amplitude Panning .............................................75 3.3.4 Two Channel, Binaural, Surround Sound .................................78 3.3.5 Transaural Surround Sound .....................................................83 3.3.6 Ambiophonics ...........................................................................94 3.4 Summary .........................................................................................96 Chapter 4 - Development of a Hierarchical Surround Sound Format.............99 4.1 Introduction......................................................................................99 4.2 Description of System......................................................................99 4.3 B-Format to Binaural Reproduction ...............................................103 4.4 Conclusions ...................................................................................110 Chapter 5 - Surround Sound Optimisation Techniques................................111 5.1 Introduction....................................................................................111 5.2 The Analysis of Multi-channel Sound Reproduction Algorithms Using HRTF Data ...............................................................................................113 5.2.1 The Analysis of Surround Sound Systems .............................113 5.2.2 Analysis Using HRTF Data.....................................................113 5.2.3 Listening Tests .......................................................................114 5.2.4 HRTF Simulation ....................................................................118 5.2.5 Impulse Response Analysis ...................................................120 5.2.6 Summary ................................................................................127 5.3 Optimisation of the Ambisonics system .........................................133 5.3.1 Introduction.............................................................................133 5.3.2 Irregular Ambisonic Decoding ................................................135 5.3.3 Decoder system......................................................................138 5.3.4 The Heuristic Search Methods ...............................................142 5.3.5 Validation of the Energy and Velocity Vector..........................151 5.3.6 HRTF Decoding Technique – Low Frequency........................157 5.3.7 HRTF Decoding Technique – High Frequency.......................159 5.3.8 Listening Test .........................................................................161 5.4 The Optimisation of Binaural and Transaural Surround Sound Systems. ..................................................................................................180 5.4.1 Introduction.............................................................................180 5.4.2 Inverse Filtering ......................................................................180 - iv - Contents 5.4.3 Inverse Filtering of H.R.T.F. Data ...........................................186 5.4.4 Inverse Filtering of H.R.T.F. Data to Improve Crosstalk Cancellation Filters. ..............................................................................189 5.5 Conclusions ...................................................................................196 5.5.1 Ambisonic Optimisations Using Heuristic Search Methods ....197 5.5.2 Further Work for Ambisonic Decoder Optimisation.................199 5.5.3 Binaural and Transaural Optimisations Using Inverse Filtering. ... ...............................................................................................200 5.5.4 Further Work for Binaural and Transaural Optimisations........200 5.5.5 Conversion of Ambisonics to Binaural to Transaural Reproduction ........................................................................................201 Chapter 6 - Implementation of a Hierarchical Surround Sound System.......203 6.1 Introduction....................................................................................203 6.1.1 Digital Signal Processing Platform .........................................204 6.1.2 Host Signal Processing Platform (home computer). ...............206 6.1.3 Hybrid System ........................................................................207 6.2 Hierarchical Surround Sound System – Implementation ...............208 6.2.1 System To Be Implemented. ..................................................208 6.2.2 Fast Convolution ....................................................................210 6.2.3 Decoding Algorithms ..............................................................214 6.3 Implementation - Platform Specifics ..............................................226 6.4 Example Application ......................................................................234 6.5 Conclusions ...................................................................................242 Chapter 7 - Conclusions ..............................................................................244 7.1 Introduction....................................................................................244 7.2 Ambisonics Algorithm development...............................................245 7.2.1 7.3 Further Work ..........................................................................251 Binaural and Transaural Algorithm Development ..........................251 7.3.1 B-format to Binaural Conversion ............................................251 7.3.2 Binaural to Two Speaker Transaural ......................................253 7.3.3 Binaural to Four Speaker Transaural......................................253 7.3.4 Further Work ..........................................................................256 Chapter 8 - References................................................................................258 Chapter 9 - Appendix ...................................................................................269 -v- Contents 9.1 Matlab Code ..................................................................................269 9.1.1 Matlab Code Used to Show Phase differences created in Blumlein’s Stereo..................................................................................269 9.1.2 Matlab Code Used to Demonstrate Simple Blumlein Spatial Equalisation ..........................................................................................270 9.1.3 Matlab Code Used To Plot Spherical Harmonics ...................271 9.1.4 Code used to plot A-format capsule responses (in 2D) using oversampling. .......................................................................................273 9.1.5 Code Used to Create Free Field Crosstalk Cancellation Filters ... ...............................................................................................275 9.1.6 Code Used to Create Crosstalk Cancellation Filters Using HRTF Data and Inverse Filtering Techniques .................................................276 9.1.7 Matlab Code Used in FreqDip Function for the Generation of Crosstalk Cancellation Filters ...............................................................278 9.1.8 9.2 Matlab Code Used To Generate Inverse Filters .....................279 Windows C++ Code.......................................................................281 9.2.1 Code Used for Heuristic Ambisonic Decoder Optimisations...281 9.2.2 Windows C++ Code used in the Real-Time Audio System Software 309 - vi - Contents List of Figures Figure 1.1 Speaker configuration developed in the multi-channel surround sound laboratory.........................................................................7 Figure 2.1 The two paths, ‘a’ and ‘b’, that sound must travel from a source at 450 to the left of a listener, to arrive at the ears. ...................10 Figure 2.2 Increasing I.L.D. with frequency and angle of incidence...........12 Figure 2.3 Simulink models showing tests for the three localisation cues provided by I.L.D. and I.T.D......................................................13 Figure 2.4 Relative phase shift for a 1 kHz sine wave delayed by 0.00025 and 0.00125 seconds ...............................................................15 Figure 2.5 An 8 kHz tone with a low frequency attack envelope ...............16 Figure 2.6 Cone of Confusion – Sources with same I.L.D. and I.T.D. are shown as grey circles. ..............................................................16 Figure 2.7 The Pinna .................................................................................18 Figure 2.8 Frequency and phase response at the right ear when subjected to an impulse at 00,450 and 900 to the right of the listener. .......19 Figure 2.9 The relationship between source incidence angle, frequency and amplitude difference between the two ears. .............................20 Figure 2.10 Relationship between source incidence angle, frequency and the phase difference between the two ears. .............................21 Figure 2.11 Relationship between source incidence angle, frequency and the time difference (in samples) between the two ears.............22 Figure 2.12 Minimum audible angle between successive tones as a function of frequency and position of source (data taken from Gulick (1989))......................................................................................23 Figure 2.13 Simple example of a source listened to in a room. Direct, four 1st order reflections and one 2nd order reflection shown (horizontal only). .......................................................................25 Figure 2.14 Impulse response of an acoustically treated listening room. ....26 Figure 2.15 Binaural impulse response from a source at 300 to the left of the listener. Dotted lines indicate some discrete reflections arriving at left ear. .................................................................................28 - vii - Contents Figure 2.16 Relationship between source elevation angle, frequency and the amplitude at an ear of a listener (source is at an azimuth of 00). .................................................................................................30 Figure 2.17 A graph showing the direct sound and early reflections of two sources in a room. ....................................................................31 Figure 2.18 A near and far source impinging on the head...........................32 Figure 3.1 Graphical depiction of early Bell Labs experiments. Infinite number of microphones and speakers model...........................35 Figure 3.2 Early Bell Labs experiment. Limited number of microphones and speakers model. ................................................................36 Figure 3.3 Standard “stereo triangle” with the speakers at +/-300 to the listener (x denotes the crosstalk path). .....................................37 Figure 3.4 Low frequency simulation of a source recorded in Blumlein Stereo and replayed over a pair of loudspeakers. The source is to the left of centre....................................................................38 Figure 3.5 Polar pickup patterns for Blumlein Stereo technique ................39 Figure 3.6 Graph showing the pick up patterns of the left speaker’s feed after spatial equalisation...........................................................40 Figure 3.7 ORTF near-coincident microphone technique. .........................42 Figure 3.8 Typical Decca Tree microphone arrangement (using omnidirectional capsules).................................................................43 Figure 3.9 A stereo panning law based on Blumlein stereo.......................44 Figure 3.10 Simplified block diagram of the Dolby Stereo encode/decode process.....................................................................................48 Figure 3.11 Plot of microphone responses derived from two figure of eight microphones.............................................................................51 Figure 3.12 The four microphone pickup patterns needed to record first order Ambisonics (note, red represents in-phase, and blue represents out-of-phase pickup). ..............................................52 Figure 3.13 Graphical representation of the variable polar patterns available using first order Ambisonics (in 2 dimensions, in this case). ....54 Figure 3.14 Velocity and Energy Vector plot of an eight-speaker array using virtual cardioids (low and high frequency directivity of d=1). ....57 - viii - Contents Figure 3.15 Virtual microphone responses that maximise the energy and velocity vector responses for an eight speaker rig (shown at 00 and 1800 for clarity). .................................................................58 Figure 3.16 Velocity and Energy Vector plot of an eight speaker Ambisonic decode using the low and high frequency polar patterns shown in Figure 3.16. ..........................................................................58 Figure 3.17 Energy and velocity vector analysis of an irregular speaker decode optimised by Gerzon & Barton (1992)..........................60 Figure 3.18 Four microphone capsules in a tetrahedral arrangement. ........61 Figure 3.19 B-Format spherical harmonics derived from the four cardioid capsules of an A-format microphone (assuming perfect coincidence). Red represents in-phase and blue represents outof-phase pickup. .......................................................................62 Figure 3.20 Simulated frequency responses of a two-dimensional, multicapsule A-format to B-format processing using a capsule spacing radius of 1.2cm............................................................63 Figure 3.21 Effect of B-format zoom parameter on W, X, and Y signals. ....65 Figure 3.22 Four different decodes of a point source polar patterns of 1st, 2nd, 3rd & 4th order systems (using virtual cardioid pattern as a 1st order reference and equal weightings of each order). Calculated using formula based on equation (3.4), using an azimuth of 1800 and an elevation of 00 and a directivity factor (d) of 1...............67 Figure 3.23 An infinite speaker decoding of a 1st, 2nd, 3rd & 4th order Ambisonic source at 1800. The decoder’s virtual microphone pattern for each order is shown in Figure 3.22. ........................68 Figure 3.24 Graph of the speaker outputs for a 1st and 2nd order signal, using four speakers (last point is a repeat of the first, i.e. 00/3600) and a source position of 1800. .........................................................69 Figure 3.25 Energy and Velocity Vector Analysis of a 4th Order Ambisonic decoder for use with the ITU irregular speaker array, as proposed by Craven (2003)......................................................70 Figure 3.26 Virtual microphone patterns used for the irregular Ambisonic decoder as shown in Figure 3.25. ............................................70 - ix - Contents Figure 3.27 The effect that the angle of radiation has on the synthesis of a plane wave using Wavefield Synthesis.....................................74 Figure 3.28 Graphical representation of the V.B.A.P. algorithm. .................76 Figure 3.29 Simulation of a V.B.A.P. decode. Red squares – speakers, Blue pentagram – Source, Red lines – speaker gains......................77 Figure 3.30 Pair of HRTFs taken from a KEMAR dummy head from an angle of 450 to the left and a distance of 1 metre from the centre of the head. Green – Left Ear, Blue – Right Ear. ...............................79 Figure 3.31 Example of a binaural synthesis problem. ................................81 Figure 3.32 Graphical representation of the crosstalk cancellation problem. .................................................................................................84 Figure 3.33 Simulation of Figure 3.32 using the left loudspeaker to cancel the first sound arriving at Mic2..................................................85 Figure 3.34 Example of free-field crosstalk cancellation filters and an example implementation block diagram. ..................................85 Figure 3.35 Frequency response of free field crosstalk cancellation filters..86 Figure 3.36 The Crosstalk cancellation problem, with responses shown. ...86 Figure 3.37 Transfer functions c1 and c2 for a speaker pair placed at +/- 300, and their corresponding crosstalk cancelling filters. .................88 Figure 3.38 Frequency response of the two speaker to ear transfer functions (c1 & c2) and the two crosstalk cancellation filters (h1 & h2) given in figure 3.31.............................................................................89 Figure 3.39 The regularisation parameter (left figure) and its effect on the frequency response of the crosstalk cancellation filters h1 & h2 (right figure). .............................................................................90 Figure 3.40 Simulation of crosstalk cancellation using a unit pulse from the left channel both with and without frequency dependent regularisation applied (as in Figure 3.39). ................................91 Figure 3.41 Example of the effect of changing the angular separation of a pair of speakers used for crosstalk cancellation. ......................93 Figure 3.42 Example of the effect of changing the angular separation of the speakers using HRTF data.......................................................94 Figure 3.43 Example Ambiophonics layout. ................................................95 Figure 4.1 Ideal surround sound encoding/decoding scheme. ................100 -x- Contents Figure 4.2 Standard speaker layout as specified in the ITU standard. ....101 Figure 4.3 Virtual Microphone Configuration for Simple Ambisonic Decoding ................................................................................103 Figure 4.4 Horizontal B-Format to binaural conversion process. .............103 Figure 4.5 Example W, X and Y HRTFs Assuming a Symmetrical Room. ...............................................................................................105 Figure 4.6 Ideal, 4-Speaker, Ambisonic Layout .......................................106 Figure 4.7 Ideal Double Crosstalk Cancellation Speaker Layout.............106 Figure 4.8 Double Crosstalk Cancellation System...................................107 Figure 4.9 Perceived localisation hemisphere when replaying stereophonic material over a crosstalk cancelled speaker pair....................107 Figure 4.10 Example of Anechoic and non-Anechoic HRTFs at a position of 300 from the listener. ..............................................................108 Figure 4.11 Spherical Harmonics up to the 2nd Order................................109 Figure 4.12 2D polar graph showing an example of a 1st and 2nd order virtual pickup pattern (00 point source decoded to a 360 speaker array). ...............................................................................................110 Figure 5.1 Speaker Arrangement of Multi-channel Sound Research Lab. ...............................................................................................115 Figure 5.2 Screen shot of two Simulink models used in the listening tests. ...............................................................................................116 Figure 5.3 Screen shot of listening test GUI. ...........................................116 Figure 5.4 Filters used for listening test signals.......................................117 Figure 5.5 Figure indicating the layout of the listening room given to the testees as a guide to estimating source position. ...................118 Figure 5.6 The Ambisonic to binaural conversion process. .....................119 Figure 5.7 Example left and right HRTFs for a real and virtual source (1st Order Ambisonics) at 450 clockwise from centre front. ...........120 Figure 5.8 The average amplitude and time differences between the ears for low, mid and high frequency ranges..................................123 Figure 5.9 The difference in pinna amplitude filtering of a real source and 1st and 2nd order Ambisonics (eight speaker) when compared to a real source...........................................................................124 - xi - Contents Figure 5.10 Listening Test results and estimated source localisation for 1st Order Ambisonics ...................................................................128 Figure 5.11 Listening Test results and estimated source localisation for 2nd Order Ambisonics ...................................................................129 Figure 5.12 Listening Test results and estimated source localisation for five speaker 1st Order Ambisonics ................................................130 Figure 5.13 Listening test results for Amplitude Panned five speaker system. ...............................................................................................131 Figure 5.14 Average Time and Frequency Localisation Estimate for 1st Order Ambisonics. ............................................................................131 Figure 5.15 Average Time and Frequency Localisation Estimate for 2nd Order Ambisonics. ..................................................................132 Figure 5.16 Average Time and Frequency Localisation Estimate for five speaker 1st Order Ambisonics. ...............................................132 Figure 5.17 RT60 Measurement of the University of Derby’s multi-channel sound research laboratory, shown in 1/3 octave bands...........133 Figure 5.18 Recommended loudspeaker layout, as specified by the ITU..134 Figure 5.19 Virtual microphone polar plots that bring the vector lengths in Equation (5.3) as close to unity as possible (as shown in Figure 5.21), for a 1st order, eight speaker rig...................................136 Figure 5.20 Velocity and energy localisation vectors. Magnitude plotted over 3600 and angle plotted at five discrete values. Inner circle represents energy vector, outer circle represents velocity vector. Using virtual cardioids. ...........................................................136 Figure 5.21 Velocity and energy localisation vectors. Magnitude plotted over 3600 and angle plotted at five discrete values. Inner circle represents energy vector, outer circle represents velocity vector. Using virtual patterns from Figure 5.19...................................137 Figure 5.22 Energy and velocity vector response of an ITU 5-speaker system, using virtual cardioids................................................138 Figure 5.23 Polar patterns of the four B-format signals used in 1st order Ambisonics. ............................................................................139 Figure 5.24 A simple Tabu Search application. .........................................146 - xii - Contents Figure 5.25 Graphical plot of the Gerzon/Barton coefficients published in the Vienna paper and the Wiggins coefficients derived using a Tabu search algorithm. Encoded/decoded direction angles shown are 00, 12.250, 22.50, 450, 900, 1350 and 1800. .............................146 Figure 5.26 The transition of the eight coefficients in a typical low frequency Tabu search run (2000 iterations). The square markers indicate the three most accurate sets of decoder coefficients (low fitness)....................................................................................147 Figure 5.27 The virtual microphone patterns obtained from the three optimum solutions indicated by the squares in figure 5.25. ....147 Figure 5.28 Energy and Velocity Vector Analysis of a 4th Order Ambisonic decoder for use with the ITU irregular speaker array, as proposed by Craven (2003)....................................................148 Figure 5.29 Virtual microphone patterns used for the irregular Ambisonic decoder as shown in Figure 5.28. ..........................................148 Figure 5.30 Screenshot of the 4th Order Ambisonic Decoder Optimisation using a Tabu Search Algorithm application. ...........................149 Figure 5.31 Graph showing polar pattern and velocity/energy vector analysis of a 4th order decoder optimised for the 5 speaker ITU array using a tabu search algorithm. ...............................................150 Figure 5.32 A decoder optimised for the ITU speaker standard. ...............151 Figure 5.33 A graph showing real sources and high and low frequency decoded sources time and level differences...........................153 Figure 5.34 Graphical representation of two low/high frequency Ambisonic decoders.................................................................................154 Figure 5.35 HRTF simulation of two sets of decoder.................................155 Figure 5.36 HRTF Simulation of head movement using two sets of decoder coefficients. ............................................................................156 Figure 5.37 Comparison between best velocity vector (top) and a HRTF set of coefficients (bottom). ..........................................................158 Figure 5.38 Polar and velocity vector analysis of decoder derived from HRTF data. .......................................................................................158 Figure 5.39 Decoder 1 – SP451 Default Settings ......................................164 Figure 5.40 Decoder 2 – HRTF Optimised Decoder..................................165 - xiii - Contents Figure 5.41 Decoder 3 – HRTF Optimised Decoder..................................165 Figure 5.42 Decoder 4 – Velocity and Energy Vector Optimised Decoder 167 Figure 5.43 Decoder 5 - Velocity and Energy Vector Optimised Decoder .167 Figure 5.44 Comparison of low frequency phase and high frequency amplitude differences between the ears of a centrally seated listener using the 5 Ambisonic decoders detailed above. .......168 Figure 5.45 Graphs showing absolute error of a decoder’s output (phase and level differences between the ears of a centrally seated listener) compared to a real source, with respect to head movement. .169 Figure 5.46 Graph Showing the Average Time and Amplitude Difference Error with Respect to A Centrally Seated Listener’s Head Orientation..............................................................................170 Figure 5.47 Sheet given to listening test candidates to indicate direction and size of sound source...............................................................172 Figure 5.48 Screenshot of Matlab Listening Test GUI. ..............................173 Figure 5.49 Graphs showing the results of the panned source part of the listening test for each subject. ‘Actual’ shows the correct position, D1 – D5 represent decoders 1 – 5. ..........................174 Figure 5.50 Graph showing mean absolute perceived localisation error with mean source size, against decoder number...........................175 Figure 5.51 Graph showing the mean, absolute, localisation error per decoder taking all three subjects into account........................176 Figure 5.52 Inverse filtering using the equation shown in Equation (5.13) 182 Figure 5.53 Frequency response of the original and inverse filters using an 8192 point F.F.T.. ...................................................................183 Figure 5.54 Typical envelope of an inverse filter and the envelope of the filter shown in Figure 5.52. .............................................................183 Figure 5.55 Two F.I.R. filters containing identical samples, but the left filter’s envelope has been transformed. ............................................184 Figure 5.56 The convolution of the original filter and its inverse (both transformed and non-transformed versions from Figure 5.55). ...............................................................................................185 Figure 5.57 A frequency and time domain response of the filter after a hamming window has been applied........................................186 - xiv - Contents Figure 5.58 The response of a 1024-point windowed inverse filter............186 Figure 5.59 The 1024-point inverse filters using a 900 and a 00, near ear, HRTF response as the signal to be inverted. .........................187 Figure 5.60 Comparison of a HRTF data set (near ear only) before (right hand side) and after (left hand side) inverse filtering has been applied, using the 900, near ear, response as the reference. .188 Figure 5.61 System to be matrix inverted. .................................................189 Figure 5.62 HRTF responses for the ipsilateral and contralateral ear responses to the system shown in Figure 5.61. .....................190 Figure 5.63 Crosstalk cancellation filters derived using the near and far ear responses from Figure 5.62....................................................190 Figure 5.64 Inverse filter response using the near ear H.R.T.F. from Figure 5.62. .......................................................................................191 Figure 5.65 Near and far ear responses after the application of the inverse filter shown in Figure 5.64 (frequency domain scaling identical to that of Figure 5.62). ................................................................192 Figure 5.66 Crosstalk cancellation filters derived using the near and far ear responses from Figure 5.65 (frequency domain scaling identical to that of Figure 5.63). ............................................................192 Figure 5.67 Filter representing inverse of h1, in both the time and frequency domain....................................................................................193 Figure 5.68 Crosstalk cancellation filters after convolution with the inverse filter shown in figure 5.51........................................................194 Figure 5.69 The optimised crosstalk cancellation system..........................194 Figure 5.70 Left Ear (blue) and Right Ear (red) responses to a single impulse injected into the left channel of double and single inverted cross talk cancellation systems........................................................195 Figure 5.71 Left Ear (blue) and Right Ear (red) responses to a single impulse injected into the left channel of a crosstalk cancellation system. ...............................................................................................196 Figure 6.1 A Von Neumann Architecture. ................................................205 Figure 6.2 Diagram of a Harvard Architecture .........................................206 Figure 6.3 The hierarchical surround sound system to be implemented. 209 Figure 6.4 Time domain convolution function. .........................................211 - xv - Contents Figure 6.5 Fast convolution algorithm......................................................212 Figure 6.6 The regular array decoding problem.......................................216 Figure 6.7 A two-speaker transaural reproduction system. .....................223 Figure 6.8 Bank of HRTFs used for a four-channel binauralisation of an Ambisonic signal.....................................................................224 Figure 6.9 Block digram of a four-speaker crosstalk cancellation system. ...............................................................................................224 Figure 6.10 Waveform audio block diagram – Wave out. ..........................227 Figure 6.11 Simulink model used to measure inter-device delays.............231 Figure 6.12 Graphical plot of the output from 4 audio devices using the Waveform audio API...............................................................232 Figure 6.13 Block Diagram of Generic ‘pass-through’ Audio Template Class ...............................................................................................233 Figure 6.14 Screen shot of simple audio processing application GUI........240 Figure 6.15 Block diagram of the applications audio processing function. 241 Figure 7.1 Recommended loudspeaker layout, as specified by the ITU..246 Figure 7.2 Low frequency (in red) and high frequency (in green) analysis of an optimised Ambisonic decode for the ITU five speaker layout. ...............................................................................................246 Figure 7.3 A graph showing a real source’s (in red) and a low frequency decoded source’s (in blue) inter aural time differences. .........247 Figure 7.4 HRTF Simulation of head movement using two sets of decoder coefficients. ............................................................................248 Figure 7.5 Energy and Velocity vector analysis of two 4th order, frequency independent decoders for an ITU five speaker array. The proposed Tabu search’s optimal performance with respect to low frequency vector length and high/low frequency matching of source position can be seen clearly........................................250 Figure 7.6 B-format HRTF filters used for conversion from B-format to binaural decoder.....................................................................252 Figure 7.7 B-format HRTF filters used for conversion from B-format to binaural decoder.....................................................................254 - xvi - Contents List of Equations (2.1) Diameter of a sphere comparable to the human head..............10 (2.2) The frequency corresponding to the wavelength equal to the diameter of the head.................................................................11 (3.1) Stereo, pairwise panning equations..........................................43 (3.2) Equation showing how to calculate a figure of eight response pointing in any direction from two perpendicular figure of eight responses.................................................................................50 (3.3) B-Format encoding equations ..................................................52 (3.4) B-Format decoding equations with alterable pattern parameter .................................................................................................53 (3.5) Example B-Format encode.......................................................54 (3.6) Example B-Format decode to a single speaker........................55 (3.7) Velocity and Energy Vector Equations .....................................56 (3.8) A-Format to B-Format conversion equations............................62 (3.9) B-format rotation and zoom equations......................................65 (3.10) 2nd order spherical harmonics...................................................66 (3.11) Calculation of the spatial aliasing frequency for wavefield synthesis ..................................................................................73 (3.12) Cross-talk cancellation problem ...............................................87 (3.13) Derivation of cross-talk cancellation filters................................87 (3.14) The cross-talk cancellation filters, h1 and h2 .............................88 (3.15) The cross-talk cancellation filters, h1 and h2 with the frequency dependent regularisation parameter.........................................89 (4.1) Ambisonic decoding equation.................................................103 (4.2) Calculation of Ambisonic to binaural HRTF filters ..................104 (4.3) Ambisonic to binaural decoding equations - general case......104 (4.4) Ambisonic to binaural decoding equations - left/right symmetry assumed.................................................................................104 (5.1) Calculation of Ambisonic to binaural HRTF filters ..................119 (5.2) Ambisonic encoding equations ...............................................120 (5.3) Energy and velocity vector equations .....................................135 (5.4) Horizontal only Ambisonic encoding equations ......................139 - xvii - Contents (5.5) Gerzon's forward dominance equation ...................................140 (5.6) Generalised five speaker Ambisonic decoder ........................140 (5.7) Magnitude, angle and perceived volume equations for the velocity and energy vectors ....................................................141 (5.8) Volume, magnitude and angle fitness equations ....................144 (5.9) Low and high frequency fitness equations..............................144 (5.10) HRTF fitness equation............................................................157 (5.11) HRTF head turning fitness equation .......................................160 (5.12) The inverse filtering problem - time domain............................181 (5.13) The inverse filtering problem - frequency domain...................181 (6.1) Convolution in the time domain ..............................................210 (6.2) Equation relating length of FFT, length of impulse response and length of signal for an overlap-add fast convolution function ..213 (6.3) Ambisonic decoding equation.................................................218 (6.4) Second order Ambisonic to Binaural decoding equation ........222 - xviii - Contents List of Tables Table 2.1 Table indicating a narrow band source’s perceived position in the median plane, irrespective of actual source position. .........18 Table 3.1 SoundField Microphone Capsule Orientation ...........................61 Table 5.1 Table showing decoder preference when listening to a reverberant, pre-recorded piece of music...............................177 Table 6.1 Matlab code used for the fast convolution of two wave files. ..214 Table 6.2 Ambi Structure........................................................................215 Table 6.3 Function used to calculate a speaker's Cartesian co-ordinates which are used in the Ambisonic decoding equations. ...........217 Table 6.4 Ambisonic cross-over function................................................219 Table 6.5 Function used to decode an Ambisonic signal to a regular array. ...............................................................................................220 Table 6.6 Function used to decode an Ambisonic signal to an irregular array. ......................................................................................221 Table 6.7 Function used to decode a horizontal only, 1st order, Ambisonic signal to headphones. ............................................................223 Table 6.8 Code used for 2 and 4 speaker transaural reproduction.........225 Table 6.9 WaveHDR structure................................................................228 Table 6.10 WaveformatEX structure. .......................................................229 Table 6.11 Initialisation code used to set up and start an output wave device. ....................................................................................230 Table 6.12 Closing a Wave Device ..........................................................232 Table 6.13 Example implementation of the ProcessAudio function for a Stereo Application. .................................................................234 Table 6.14 C++ Class definition file for an allpass based shelving equalisation unit. ....................................................................235 Table 6.15 C++ class definition file for the fast convolution algorithm ......236 Table 6.16 Constructor for the FastFilter class.........................................237 Table 6.17 Matlab function used to write FIR coefficients to a file............237 Table 6.18 C++ code used to read in the FIR coefficients from a file. ......238 Table 6.19 Decoding switch statement in the example application ..........242 - xix - Contents Acknowledgements Many thanks must go to my supervisors, Iain Paterson-Stephens and Richard Thorn for their greatly appreciated input throughout this research. I thank Stuart Berry and Val Lowndes for introducing me to the world of heuristic search methods and Peter Lennox, Peter Schillebeeckx and Howard Stratton who have been constant sources of opinion, knowledge and wisdom on various areas of my project. Finally, I must thank Rachel, for keeping my feet on the ground, keeping me sane, and putting up with the, seemingly, everlasting write-up period. - xx - Contents Abstract This thesis describes a system that can be used for the decoding of a three dimensional audio recording over headphones or two, or more, speakers. A literature review of psychoacoustics and a review (both historical and current) of surround sound systems is carried out. The need for a system which is platform independent is discussed, and the proposal for a system based on an amalgamation of Ambisonics, binaural and transaural reproduction schemes is given. In order for this system to function optimally, each of the three systems rely on providing the listener with the relevant psychoacoustic cues. The conversion from a five speaker ITU array to binaural decode is well documented but pair-wise panning algorithms will not produce the correct lateralisation parameters at the ears of a centrally seated listener. Although Ambisonics has been well researched, no one has, as yet, produced a psychoacoustically optimised decoder for the standard irregular five speaker array as specified by the ITU as the original theory, as proposed by Gerzon and Barton (1992) was produced (known as a Vienna decoder), and example solutions given, before the standard had been decided on. In this work, the original work by Gerzon and Barton (1992) is analysed, and shown to be suboptimal, showing a high/low frequency decoder mismatch due to the method of solving the set of non-linear simultaneous equations. A method, based on the Tabu search algorithm, is applied to the Vienna decoder problem and is shown to provide superior results to those shown by Gerzon and Barton (1992) and is capable of producing multiple solutions to the Vienna decoder problem. During the write up of this report Craven (2003) has shown how 4th order circular harmonics (as used in Ambisonics) can be used to create a frequency independent panning law for the five speaker ITU array, and this report also shows how the Tabu search algorithm can be used to optimise these decoders further. A new method is then demonstrated using the Tabu search algorithm coupled with lateralisation parameters extracted from a binaural simulation of the Ambisonic system to be optimised (as these are the parameters that the Vienna system is approximating). This method can then be altered to take into account head rotations directly which have been shown as an important psychoacoustic parameter in the localisation of a - xxi - Contents sound source (Spikofski et al., 2001) and is also shown to be useful in differentiating between decoders optimised using the Tabu search form of the Vienna optimisations as no objective measure had been suggested. Optimisations for both Binaural and Transaural reproductions are then discussed so as to maximise the performance of generic HRTF data (i.e. not individualised) using inverse filtering methods, and a technique is shown that minimises the amount of frequency dependant regularisation needed when calculating cross-talk cancellation filters. - xxii - Chapter 1 Chapter 1 - Introduction 1.1 Background Surround sound has quickly become a consumer ‘must have’ in the audio world, due, in the main part, to the advent of the Digital Versatile Disk, Super Audio CD technology and the computer gaming industry. It is generally taken to mean a system that creates a sound field that surrounds the listener. Or, to be put another way, it is trying to recreate the illusion of the ‘you are there’ experience. This is in contrast to the stereophonic reproduction that has been the standard for many years, which creates a ‘they are here’ illusion (Glasgal, 2003c). The direction that the surround sound industry has taken, when referring to format and speaker layout, has depended, to some extent, on which system the technology has been used for. As already mentioned, two main streams of surround sound development are taking place: • The DVD Video/Audio industry can be broadly categorised as follows: o These systems are predicated around audio produced for a standard 5 speaker (plus sub-woofer, or low frequency effects channel) layout as described in the ITU standard ‘ITU-R BS.7751’. o Few DVD titles deviate from this standard as most DVD players are hardware based and, therefore, of a fixed specification. o Some processors are available with virtual speaker surround (see crosstalk cancelled systems) and virtual headphone surround systems. o Recording/panning techniques are not fixed and many different systems are utilised including: ̇ ̇ ̇ Coincident recording techniques Spaced recording techniques Pair-wise panned using amplitude or time or a combination of the two. -1- Chapter 1 • The computer gaming industry can be broadly categorised as follows: o Number and layout of speakers are dictated by the soundcard installed in the computer. Typically: ̇ ̇ ̇ ̇ ̇ Two speakers – variable angular spacing. Four speakers – based on a Quadraphonic arrangement or the ITU five speaker layout without a centre speaker. Five speakers – based on ITU-R BS.755-1 layout. Six speakers – same as above but with a rear centre speaker. Seven speakers – typically, same as five speakers with additional speakers at +/- 900. o Two channel systems rely on binaural synthesis (using head related transfer functions) and/or crosstalk cancellation principles using: ̇ ̇ Binaural/Transaural simulation of a more than two speaker system. HRTF simulation of sources. o More than two speaker systems generally use pair-wise panning algorithms in order to place sounds. Both of the above viewpoints overlap, mainly due to the need for computers to be compatible with DVD audio/video. However, the computer gaming industry has started moving away from five speaker surround with 7.1 surround sound being the standard on most new PCs. The systems described above all co-exist, often being driven by the same carrier signals. For example, all surround sound output on a DVD is derived from the 5.1 speaker feeds that are stored on the actual disk. So headphone surround processing can be carried out by simulating the 5.1 speaker array binaurally, and two speaker virtual surround systems can be constructed by playing a crosstalk cancelled version of the binaural simulation. In the same fashion many crosstalk cancelled and binaural decodes provided by the audio hardware in computers is driven by the signal that would normally be sent to the 4, 5, 6 or 7 speaker array with other cards choosing to process the sound -2- Chapter 1 effects and music directly with individual pairs of head related transfer functions (see CMedia, N.D. and Sibbald, A., 2000 for examples of these two systems). The above situation sounds ideal from a consumer choice, point of view, but there are a number of issues with the systems, described above, as a whole. The conversion from multi-speaker to binaural/transaural (crosstalk cancelled) system assumes that a, normally pair-wise panned, speaker presentation will provide the ear/brain system with the correct cues needed for the listener to experience a truly immersive, psychoacoustically correct aural presentation. However, the five speaker layout, as specified by the ITU, was not meant to deliver this, and is predicated on a stable 600 frontal image, with the surround speakers used only for effects and ambience information. This is, of course, not a big issue for films, but as computer games and audio only presentations are based around the same, five speaker, layout, this is not ideal. Computer games often do not want to give a preference to any particular direction with the surround sound audio experience hopefully providing extra cues to the game player in order to give them a more accurate auditory ‘picture’ of the environment around them and music presentations often want to try and simulate the space that the music was recorded in as accurately as possible, which will include material from the rear and sides of the listener. A less obvious problem with PC based audio systems is that although the final encoding and decoding of the material is handled by the audio hardware (as most sound sources for games are panned in real-time), and so it is the hardware that dictates what speaker/headphone setup to use, inserting prerecorded surround sound music can be problematic as no speaker layout can be assumed. Conversely for the DVD systems, the playing of music is, obviously, well catered for but only as long as it is presented in the right format. Converting from a 5.1 to a 7.1 representation, for example, is not necessarily a trivial matter and so recordings designed for a 5.1 ITU setup cannot easily use extra speakers in order to improve the performance of the recording. This is especially true as no panning method can be assumed after the discrete speaker feeds have been derived and stored on the DVD. -3- Chapter 1 The problems described above can be summarised as follows: • 5.1 DVD recordings cannot be easily ‘upmixed’ as: o No panning/recording method can be assumed. o Pair-wise panned material cannot be upmixed to another pairwise panned presentation (upmixing will always increase the • number of speakers active when panning a single source). Computer gaming systems produce surround sound material ‘on-thefly’ and so pre-recorded multi-channel music/material can be difficult to • add as no presentation format can be assumed. Both systems, when using virtual speaker technology (i.e. headphone or cross talk cancelled simulation of a multi-speaker representation) are predicated on the original speaker presentation delivering the correct psychoacoustical cues to the listener. This is not the case for the standard, pair-wise panned method which relies on this crosstalk to present the listener with the correct psychoacoustic cues (see Blumlein’s Binaural Sound in chapter 3.2.2). These problems stem, to some extent, from the lack of separation between the encoding and the decoding of the material, with the encode/decode process generally taken as a whole. That is the signals that are stored, used and listened to are always derived from speaker feeds. This then leads to the problem of pre-recorded pieces needing to either be re-mixed and/or rerecorded if the number or layout of the speakers is to be changed. 1.2 The Research Problem How can the encoding be separated from the decoding in audio systems, and how can this system be decoded in a psychoacoustically aware manner for multiple speakers or headphone listening? While the transfer from multiple speaker systems to binaural or crosstalk cancelled systems is well documented, the actual encoding of the material must be carried out in such a way so as to ensure: -4- Chapter 1 • • Synthesised or recorded material can be replayed over different speaker arrays. The decoded signal should be based on the psychoacoustical parameters with which humans hear sound thus allowing a more meaningful conversion from a multi-speaker signal to binaural or crosstalk cancelled decode. The second point would be best catered for using a binaural recording or synthesis technique. However, upmixing from a two channel binaural recording to a multi-speaker presentation can not be carried out in a satisfactory way, with the decoder for such a system needing to mimic all of the localisation features of the ear/brain system in order to correctly separate and pan sounds into the correct position. For this reason, it is a carrier signal based on a multi-speaker presentation format that will be chosen for this system. Many people sought to develop a multi-speaker sound reproduction system as early as the 1900s, with work by Bell Labs trying to create a truly ‘they are here’ experience using arrays of loudspeakers in front of the listener. Perhaps they were also striving for a true volume solution which, to a large extent, has still not been achieved (except in a system based on Bells’ early work called wavefield synthesis, see Chapter 3). However, it was Alan Blumlein’s system, binaural sound, that was to form the basis for the system we now know as stereo, although it was to be in a slightly simplified form than the system that Blumlein first proposed. The first surround sound standard was the Quadraphonic format. This system was not successful due to the fact that it was based on the simplified stereo technique and so had some reproduction problems coupled with Quadraphonics having a number of competing standards. At around the same time a number of researchers, including Michael Gerzon, recognised these problems and proposed a system that took more from Blumlein’s original idea. This new system was called Ambisonics, but due to the failings of the Quadraphonic system, interest in this new surround sound format was poor. -5- Chapter 1 Some of the benefits of the Ambisonics system are now starting to be realised and it is this system that was used as the basis of this investigation. 1.3 Aims and Objectives of the Research • • Develop a flexible multi-channel sound listening room capable of the auditioning of several speaker positioning formats simultaneously. Using the Matlab/Simulink software combined with a PC and a multichannel sound card, create a surround sound toolbox enabling a flexible and quick development environment used to encode/decode • surround sound systems in real-time. Carry out an investigation into the Ambisonic surround sound system looking at the optimisation of the system for different speaker configurations, specifically concentrating on the ITU standard five • speaker layout. Carry out an investigation into Binaural and Transaural sound reproduction and how the conversion from Ambisonics to these • systems can be achieved. Propose a hybrid system consisting of a separate encode and decode process, making it possible to create a three-dimensional sound piece • which can be reproduced over headphones or two or more speakers. Create a real-time implementation of this system. At the beginning of this project, a multi-channel sound lab was setup so different speaker layouts and decoding schemes could be auditioned. The lab contained speakers placed in a number of configurations so that experiments and testing would be quick to set up, and flexible. It consisted of a total of fourteen speakers as shown in Figure 1.1. Three main speaker system configurations have been incorporated into this array: • • • A regularly spaced, eight speaker, array A standard ITU-R BS.755-1 five speaker array A closely spaced front pair of speakers -6- Chapter 1 600 800 800 1400 Figure 1.1 Speaker configuration developed in the multi-channel surround sound laboratory The system, therefore, allows the main forms of multi-speaker surround formats to be accessed simultaneously. A standard Intel® Pentium® III (Intel Corporation, 2003) based PC was used in combination with a Soundscape® Mixtreme® (Sydec, 2003) sixteen channel sound card. This extremely versatile setup was originally used with the Matlab®/Simulink® program (The MathWorks, 2003), which was possible after rewriting Simulinks ‘To’ and ‘From Wave Device’ blocks to handle up to sixteen channels of audio simultaneously and in real-time (the blocks that ship with the product can handle a maximum of two channels of audio, see Chapter 5). This system was then superseded by custom C++ programs written for the Microsoft Windows operating system (Microsoft Corporation, 2003), as greater CPU efficiency could be utilised this way, which is an issue for filtering and other CPU intensive tasks. Using both Matlab/Simulink and dedicated C++ coded software it was possible to both test, evaluate and apply optimisation techniques to the decoding of an Ambisonics based surround sound system and to this end the aim of this project was to develop a surround sound format, based on the hierarchical nature of B-format, the signal carrier of Ambisonics, that was able -7- Chapter 1 to be decoded to headphones and speakers, and investigate and optimise these systems using head related transfer functions. 1.4 Structure of this Report This report is split into three main sections as listed below: 1. Literature review and discussion: a. Chapter 2 – Psychoacoustics and Spatial Sound Perception b. Chapter 3 – Surround Sound Systems 2. Surround sound format proposal and system development research a. Chapter 4 – Hierarchical Surround Sound Format b. Chapter 5 – Surround Sound Optimisation Techniques 3. System implementation and signal processing research a. Chapter 6 – Implementation of a Hierarchical Surround Sound System. Sections two and three detail the actual research and development aspects of the project with section one giving a general background into surround sound and the psychoacoustic mechanisms that are used to analyse sounds heard in the real world (that is, detailing the systems that must be fooled in order to create a realistic, immersive surround sound experience). -8- Chapter 2 Chapter 2 - Psychoacoustics and Spatial Sound Perception 2.1 Introduction This Chapter contains a literature review and discussion of the current thinking and research in the area of psychoacoustics and spatial sound perception. This background research is important as it is impossible to investigate and evaluate surround systems objectively without first knowing how our brain processes sound, as it is this perceptual system that we are aiming to fool. This is particularly true when optimisations are to be sought after, as unless it is known what parameters we are optimising for, only subjective and empirically derived alterations can be used to improve a system’s performance or, in the same way, help us explain why a system is not performing as we would have hoped. 2.2 Lateralisation One of the most important physical rudiments of the human hearing system is that it possesses two separate data collection points, that is, we have two ears. Many experiments have been conducted throughout history (for a comprehensive reference on these experiments see Blauert (1997) and Gulick et al. (1989)) concluding that the fact that we hear through two audio receivers at different positions on the head is important in the localisation of the sounds (although our monaural hearing capabilities are not to be underestimated). If we observe the situation shown in Figure 2.1 where a sound source (speaker) is located in an off-centre position, then there are a number of differences between the signals arriving at the two ears, after travelling paths ‘a’ and ‘b’. The two most obvious differences are: • • The distances travelled by the sounds arriving at each ear are different (as the source is closer to the left ear). The path to the further away of the two ears (‘b’) has the added obstacle of the head. -9- Chapter 2 These two separate phenomena will manifest themselves at the ears of the listener in the form of time and level differences between the two incoming signals and, when simulated correctly over headphones, will result in an effect called lateralisation. Lateralisation is the sensation of a source being inside the listener’s head. That is, the source has a direction, but the distance of the listener to the source is perceived as very small. If we take the speed of sound as 342 ms-1 and the diameter of an average human head (based on a sphere, with the ears at 900 and 2700 of that sphere) as 18 cm, then the maximum path difference between the left and right ears (d) is half the circumference of that sphere, given by equation (2.1). d = Πr = Π × 0.09 = 0.28274m (2.1) where d is half the circumference of a sphere r is the radius of the sphere b a Figure 2.1 The two paths, ‘a’ and ‘b’, that sound must travel from a source at 450 to the left of a listener, to arrive at the ears. Taking the maximum circumferential distance between the ears as 28 cm, as shown in equation (2.1), this translates into a maximum time difference between the sounds arriving at the two ears of 0.83 ms. This time difference is termed the Interaural Time Difference (I.T.D.) and is one of the cues used by the ear/brain system to calculate the position of sound sources. - 10 - Chapter 2 The level difference between the ears, termed I.L.D. (Interaural Level Difference) is not, substantially, due to the extra distance travelled by the sound. The main difference here is obtained from the shadowing effect of the head. So, unlike I.T.D., which will be the same for all frequencies (although the phase difference is not constant), I.L.D. is frequency dependent due to diffraction. As a simple rule of thumb, any sound that has a wavelength larger than the diameter of the head will tend to be diffracted around and any sound with a wavelength shorter than the diameter of the head will tend to be attenuated causing a low pass filtering effect. The frequency corresponding to the wavelength equal to the diameter of the head is shown in equation (2.2). f =1 0.18 × 342 = 1.89kHz (2.2) where 0.18 is the diameter of the head. There is, however, a smooth transition from low to high frequencies that means that the attenuation occurring at the opposite ear will increase with frequency. A graph showing an approximation of the I.L.D. of a sphere, up to 2 kHz, is shown in Figure 2.2 (equations taken from Duda (1993)). This figure shows the increasing I.L.D. with increasing frequency and angle of incidence. - 11 - Chapter 2 12 900 770 640 10 0 ILD (dB) 8 39 6 260 4 130 Source Position (degrees) 51 2 00 0 1 2 10 Figure 2.2 10 Frequency (Hz) 3 10 Increasing I.L.D. with frequency and angle of incidence. 2.2.1 Testing the Lateralisation Parameters. A few simple experiments can be set up in order to test the working frequency ranges, and the effectiveness of the sound source position artefacts described above. The two cues presented, I.L.D. and I.T.D. actually result in three potential auditory cues. They are: • • • An amplitude difference between the two ears (I.L.D). A time difference between the two ears (I.T.D). A phase difference between the sounds at the ears (I.T.D.). Simulink models that can be used to test these three localisation parameters, under headphone listening conditions, are shown in Figure 2.3. Several data sources are utilised in these models (also shown in Figure 2.3) and are discussed below. - 12 - Chapter 2 g1 array g2 array 1 1 0.8 0.5 0.6 0 0.4 -0.5 0.2 0 -1 0 2 4 6 0 2 4 6 5 5 x 10 x 10 0.5 0 -0.5 0 Figure 2.3 0.5 1 1.5 2 2.5 3 1 second duration of signal array 3.5 4 4.5 4 x 10 Simulink models showing tests for the three localisation cues provided by I.L.D. and I.T.D.. Arrays ‘g1’ and ‘g2’ are a rectified sine wave and a cosine wave, and are used to represent an amplitude gain, a phase change or a time delay. In order for the various lateralisation cues to be tested, the models must be configured as described below: • Level Difference – If ‘g1’ is taken as the gain of the left channel, and a rectified version of ‘g2’ is used for the gain of the right channel, then the sound source is level panned smoothly between the two ears, and this is • what the listener perceives, at any given frequency. Phase Difference – A sine wave of any phase can be constructed using a mixture of a sine wave at 00 and a sine wave at 900 (a cosine). So applying the gains ‘g1’ and ‘g2’ to a sine and a cosine wave which are then summed, will create a sine wave that changes phase from -Π/2 to Π/2. At low frequencies this test will tend to pan the sound between the two ears. However, as the frequency increases the phase difference between the signals has less effect. For example, at 500 Hz the sounds lateralises very noticeably. At 1000 Hz only a very slight source movement is - 13 - Chapter 2 perceivable and at 1500 Hz, although a slight change in timbre can be • noted, the source does not change position. Time Difference – For this test a broad band random noise source was used so that the sound contained many transients. The source was also pulsed on and off (see Figure 2.3) so that as the time delay between the two ears changed the pulsed source would not move significantly while it was sounding. The time delay was achieved using two fractional delay lines, using ‘g1’ and a rectified ‘g2’ scaled to give a delay between the ears varying from –0.8 ms to 0.8 ms (+/- 35 samples at 44.1 kHz), which roughly represents a source deflection of –900 to 900 from straight ahead. Slight localisation differences seem to be present up to a higher frequency than with phase differences, but most of this cue’s usefulness seems to disappear after around 1000 Hz. It is clear that the phase and time differences between the two ears of the listener are related, but they should be considered as two separate cues to the position of a sound source. For example if we take a 1 kHz sine wave, the period is equal to 0.001 seconds. If this sound is delayed by 0.00025 seconds, the resulting phase shift will be 900. However, if the sine wave is delayed by 0.00125 seconds the phase shift seen will be 4500. As the ears are not able to detect absolute phase shift they must compare the two ears’ signals, which will still give a phase shift of 900 as shown in Figure 2.4. It is also apparent from Figure 2.4 that if a sound of a different frequency is used, the same time delay will give a different phase difference between the ears. As frequency increases the phase change due to path differences between the ears becomes greater, but once the phase difference between the two ears is more than 1800 then the brain can no longer decide which signal is lagging and the cue becomes ambiguous (Gulick, 1989). - 14 - Chapter 2 1 0 -1 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 Sample Number (fs=44100Hz) 250 300 1 0 -1 1 0 -1 Figure 2.4 Relative phase shift for a 1 kHz sine wave delayed by 0.00025 and 0.00125 seconds The difference between time and phase cues is significant, as they will need to be utilised by the ear/brain system for different localisation situations. If we take the situation where the listener is trying to localise a continuous sine wave tone, the time of arrival cues seen in Figure 2.4 will be not be present and only phase and amplitude cues can be used (it should also be noted that a pure sine wave tone can be a difficult source to locate anyway). Alternatively, if the listener is trying to localise a repeating ‘clicking’ sound, then the time of arrival cues due to source position will be present. Also, it has been found that, even for higher frequency sounds, time/phase cues can still be utilised with regards to the envelope of the sound arriving at the head, as shown in Figure 2.5. - 15 - Chapter 2 Figure 2.5 An 8 kHz tone with a low frequency attack envelope Using a combination of the cues described above, a good indication of the angle of incidence of an incoming sound can be constructed, but the sound will be perceived as inside the head with the illusion of sounds coming from behind the listener being more difficult to achieve. The reason for this is the so-called ‘Cone of Confusion’ (Begault, 2000). Any sound that is coming from a cone of directions (shown as grey circles in Figure 2.6) will have the same level, phase and time differences associated with it making the actual position of the source potentially ambiguous. Figure 2.6 Cone of Confusion – Sources with same I.L.D. and I.T.D. are shown as grey circles. - 16 - Chapter 2 So how does the ear/brain system cope with this problem? There are two other mechanisms that help to resolve the position of a sound source. They are: • • Head movement. Angular dependent filtering. Head movement can be utilised by the ear/brain system to help strengthen auditory cues. For example if a source is at 450 to the left (where 00 represents straight ahead), then turning the head towards the left would decrease the I.L.D. and I.T.D. between the ears and turning the head to the right would increase the I.L.D. and I.T.D. between the ears. If the source were located behind the listener the opposite would be true, giving the ear/brain system an indication of whether the source is in the front or the back hemi-sphere. In a similar fashion, up/down differentiation can also be resolved with a tilting movement of the head. This is a very important cue in the resolution of front/back reversals perfectly demonstrated by an experiment carried out by Spikofski et al. (2001). In this experiment a subject listens to sounds recorded using a fixed dummy head with small microphones placed in its ears. Although reported lateralisation was generally good, many front back reversals are present for some listeners. The same experiment is then conducted with a head tracker placed on the listeners head which controls the angle that the dummy head is facing (that is, the recording dummy head mirrors the movements of the listener in real-time). In this situation virtually no front/back reversals are perceived by the listener. Optimising binaural presentations by utilising the head turning parameter is well documented, however, its consideration in the optimisation of speaker based systems has not been attempted, but will be investigated in this project. Angular dependant filtering is another cue used by the ear/brain system, and is the only angular direction cue that can be utilised monaurally, that is, sound localisation can be achieved by using just one ear (Gulick, 1989). The filtering results from the body and features of the listener, the most prominent of which is the effect of the pinnae, the cartilage and skin surrounding the opening to the ear canal, as shown in Figure 2.7. - 17 - Chapter 2 Figure 2.7 The Pinna The pinna acts as a very complex filtering device, imprinting a unique phase and frequency response onto pressure waves impinging on the head, depending on the angular direction of this pressure wave. This implies that sound sources made up of certain bands are more likely to be heard as emanating from a particular location due to the natural peaks and troughs that are apparent in the HRTF data due to pinna filtering, and this has been shown in experiments using narrow-band sound sources. For example, Zwicker & Fastl (1999) found that narrow band sources of certain frequencies are located at certain positions on the median plane, irrespective of the position of the sound source as indicated in Table 2.1. Table 2.1 Narrow band source Perceived position (in centre frequency the median plane) 300Hz, 3kHz Front 8kHz Above 1kHz, 10kHz Behind Table indicating a narrow band source’s perceived position in the median plane, irrespective of actual source position. The example filters shown in Figure 2.8 (taken from HRTF data measured at the MIT media lab by Gardner & Martin (1994)) shows the phase/magnitude response at the right ear due to a source at 00,450 and 900 to the right of the listener. Interestingly, if the complex filtering from a moving source is heard from a stationary sound source using both ears (e.g. if an in-ear recording is replayed over speakers), the listener will perceive timbral changes in the heard material. - 18 - Chapter 2 Figure 2.8 Frequency and phase response at the right ear when subjected to an impulse at 00,450 and 900 to the right of the listener. Using the points discussed above, a number of simple assumptions can be made about the human auditory system. • Amplitude differences between the ears will only be present, and therefore can only be utilised, in sounds greater than some frequency • (that is, when the sound no longer diffracts around the head). Phase cues can only be totally unambiguous if the sound is delayed by less than half the corresponding wavelength of the sound’s frequency (i.e. low frequencies), but may still be utilised together with other cues (such as I.L.D.) up to a delay corresponding to a full wavelength (a • phase change of 3600) (Gulick, W.L. et al., 1989). Time cues can only be useful when transients are apparent in the sound source, e.g. at the beginning of a sound. 2.2.2 Analysis of the Lateralisation Parameters In order to quantify what frequency ranges the lateralisation parameters are valid for, an example ‘head’ is now used. This head was measured at the M.I.T. media lab in the U.S.A. and the impulse response measurements for a great many source positions were taken in an anechoic room. The resulting impulse responses are measures of the Head Related Transfer Function - 19 - Chapter 2 (which result in Head Related Impulse Responses, but are still generally known as HRTFs) due to the dummy head. As the tests were carried out in an anechoic chamber, they are a very good measure of how we lateralise sound sources, that is, the minimum of auditory cues are present as no information regarding the space in which the recordings are made is apparent. Figure 2.9 shows a plot representing the amplitude difference (z-axis) measured between the two ears for frequencies between 0 Hz and 20 kHz (xaxis) and source angles between 0 and 1800 (y-axis). The red colouring indicates that there is no amplitude difference between the ears, and is most apparent at low frequencies, which is expected as the head does not obstruct the sound wave for these, longer, wavelengths. The amplitude differences in the signals arriving at the ears can be seen to occur at around 700 Hz and then can be seen to increase after this point. This graph shows a significant difference between modelling the head as a sphere (as in Figure 2.2) and measuring the non-spherical dummy head with amplitude peaks and troughs becoming very evident. Figure 2.9 The relationship between source incidence angle, frequency and amplitude difference between the two ears. - 20 - Chapter 2 Figure 2.10 shows a very similar graph, but this time, representing the phase difference between the two ears. The colour scaling now goes from –1800 to 1800 (although the scale on this graph is in radians, from -3.142 to 3.142). A clear pattern can be observed with the limit of unambiguous phase differences between the ears following a crescent pattern with no phase differences occurring when sounds are directly in front of or behind the listener. The largest phase difference between the ears is to be found from a source at an angle of 900 to the listener where unambiguous phase differences occur up to approximately 800 Hz. The anomalies apparent in this figure (negative phase difference) could be due to one of two effects: • • Pinna, head and torso filtering. Errors in the measured HRTF data. Of the two possible effects, the second is most likely, as the compact set of HRTFs were used (see Gardner & Martin (1994)). The compact set of HRTFs has been processed in such a way as to cut down their size and inverse filtered in a crude manner. Given these limitations, a good trend in terms of the phase difference between the two ears is still evident. Figure 2.10 Relationship between source incidence angle, frequency and the phase difference between the two ears. - 21 - Chapter 2 Figure 2.11 shows the time of arrival difference between the two ears, and also indicates why interaural time difference and interaural phase difference should be considered as two separate auditory cues. Usable time differences are apparent for every frequency of sound as long as the source is at an offcentre position, and this is the only lateralisation cue for which this is the case. This graph also shows that filtering due to the pinna, head and torso create differing time delays which are dependent upon the frequency of the incoming sound. If some form of time delay filtering were not present (i.e. no head/torso or pinna filtering), the time difference for each source angle of incidence would be constant across the audio spectrum. Figure 2.11 Relationship between source incidence angle, frequency and the time difference (in samples) between the two ears. The three graphs shown in Figure 2.9, Figure 2.10 and Figure 2.11 usefully provide an insight into possible reasons for a number of psychoacoustic phenomena. If we consider the minimum audible angle (M.A.A.) for sounds of differing frequencies, and source azimuths (where the M.A.A. is taken as the angle a source has to be displaced by, until a perceived change in location is noted), it can be seen that the source’s M.A.A. gets larger the more off-centre the source’s original position (see Figure 2.12 and Gulick (1989)). This is - 22 - Chapter 2 coupled with the M.A.A. increasing for all source positions between the frequencies of 1 kHz and 3 kHz. The question arises; can the M.A.A. effect be explained using the three H.R.T.F. analysis figures given above? Firstly, why would the minimum audible angle be greater the more off-centre the sound source for low frequencies? If the phase difference graph is observed, then it can be seen that the gradient of the change of phase difference with respect to head movement is greatest when a source is directly behind or directly in front of the listener. That is, if the head is rotated 10, then a source directly in front of the listener will create a greater phase change between the two listening conditions when compared to a source that is at an azimuth of 900 implying an increased resolution to the front (and rear) of the listener. 14 Minimum Audible Angle (degrees) 12 0 Degrees 30 Degrees 60 Degrees 10 8 6 4 2 0 Figure 2.12 500 1000 Frequency (Hz) 5000 10000 Minimum audible angle between successive tones as a function of frequency and position of source (data taken from Gulick (1989)). It should also be noted that the M.A.A. worsens between 1 kHz and 3 kHz. If the interaural amplitude is studied, it can be seen that the difference between the ears starts to become pronounced after approximately 1 kHz and does not become more obvious until higher frequencies. Also, 1 kHz is around the frequency where unambiguous phase cues start to disappear (and more so as the angle of incidence of the source increases). It is this cross-over period between the brain using level and phase cues where the M.A.A. is at its - 23 - Chapter 2 largest. Another interesting result, that can also be seen from Figure 2.12, is that phase cues (used primarily at low frequencies) perform better, on average, than higher frequency cues (pinna filtering and level differences) and it is often mentioned that low frequency, temporal, cues are the more robust cues (for example, Wightman, F.L. and Kistler, D.J., 1992 and Huopaniemi, J. et al, 1999). 2.3 Sound Localisation The term localisation differs from lateralisation in that not only is source direction angle arrived at, but a listener can gain information on the type of location a sound is emanating from and the distance from the source to the listener. Also, information on the size of a sound source as well as which way it may be facing can be gleaned just by listening for a short time. 2.3.1 Room Localisation When walking into an acoustic space for the first time, the brain quickly makes a number of assumptions about the listening environment. It does this using the sound of the room (using any sounds present) and the reaction of the listener inside this room. One example of this is when walking into a cathedral. In this situation one of the first sounds possibly heard will be your own footsteps, and this will soon give the impression that the listener is in a large, enclosed space. This is also the reason that people susceptible to claustrophobia are ill advised to enter an anechoic chamber, as the lack of any reverberation in the room can be very disconcerting, and bring on a claustrophobic reaction. Interestingly, listening to sound sources in an anechoic chamber will often give the impression that the sound source is almost ‘inside the head’ (much like listening to conventional sound sources through headphones). The human brain is not used to listening to sounds without a corresponding location (even large open expanses have sound reflections from the floor), and the only time this will happen is if the source is very close to the head, somebody whispering in your ear, for example, and so the brain decides that any sound without a location is likely to be very close. - 24 - Chapter 2 If we are listening to a sound source in a real location, a large number of reflections may also reach the ears. The first sound that is heard will be the direct sound, as this has the shortest path length (assuming nothing obstructs the source). Then, the first order reflections will be heard. Figure 2.13 shows a simplified example of this (in two dimensions). Here it can clearly be seen that the direct sound has the shortest path length, which implies that this signal has the properties listed below: • The direct sound will be the loudest signal from the source to reach the listener (both due to the extra path length and the fact that some of the • • reflected source’s energy will be absorbed by the reflective surface). The direct sound will be the first signal to reach the ears of the listener. The direct sound may be the only signal that will be encoded (by the head of the listener) in the correct direction. 1 2nd Order Reflections 0.9 0.8 0.7 0.6 Source → 0.5 ← Listener 0.4 Direct Sound 0.3 0.2 1st Order Reflections 0.1 0 Figure 2.13 0 0.2 0.4 0.6 0.8 1 Simple example of a source listened to in a room. Direct, four 1st order reflections and one 2nd order reflection shown (horizontal only). In the example shown above (Figure 2.13) a simple square room is shown along with four of the 1st order sound reflections (there are two missing, one from the floor and one from the ceiling) and one 2nd order reflection. These signal paths will also be attenuated due to absorption associated with the wall and the air. Surfaces in a room, and the air itself, possess an absorption coefficient, a numerical grade of acoustic absorption, although a more - 25 - Chapter 2 accurate measure is the frequency dependent absorption coefficient. As reflections in the room build up to higher and higher orders, a diffuse sound field is created, where the individual echoes are more difficult to analyse. Figure 2.14 shows an example impulse response of an actual room. The room has a reasonably short reverberation time as the walls are acoustically treated with foam panels. The graph shows ¼ of a second in time (11025 samples at 44.1 kHz sampling rate). 10 x 10 -3 Direct Sound 8 Early, discrete Reflections 6 Amplitude 4 Diffuse Tail 2 0 -2 -4 -6 -8 0 Figure 2.14 2000 4000 6000 Sample Number 8000 10000 12000 Impulse response of an acoustically treated listening room. As mentioned at the beginning of this section, the response of a room gives listeners significant insight into the type of environment that they are in. However, Figure 2.14 shows a very complicated response. So how does the brain process this? An extremely important psychoacoustic phenomenon and one that the ear/brain system uses in this type of situation has been termed the precedence effect (Begault, 2000). The precedence effect is where the brain gives precedence to the sound arriving at the listener first, with the direction of this first sound taken as the angular direction indicator. This sounds very simple, but as we have two ears, the initial sound arrives at the ears twice and, therefore, has two arrival times associated with it. Figure 2.15 shows the equivalent reverberation impulse responses that arrive at both ears. The source used in this graph is at 300 to the left of the listener very - 26 - Chapter 2 close to the rear wall, and about 1 metre away from the left wall. It can clearly be seen that the source’s direct sound arrives at the left ear first, followed, around 11 samples later (0.25 ms at 44.1 kHz), by the right ear. As the ear/brain system uses this time difference to help lateralise the incoming sound, the precedence effect does not function for such short time differences. Under laboratory tests it has been noted that if the same signal is played into each ear of a pair of headphones, but one channel is delayed slightly (Begault, 2000): • • For a delay between 0 and 0.6mS the source will move from the centre towards the undelayed side of the listeners head. Between approximately 0.7 and 35mS the source will remain at the undelayed side of the listeners head, that is, the precedence effect employs the first source to determine the lateralisation. However, although the source position will not change, the perceived tone, and width of the source will tend to alter as the delay between the left and right ears is increased (note that this implies an effect analogous to comb filtering which occurs during the processing of the sounds • arriving at the two ears by the brain of the listener). Finally, increasing the time delay still further will create the illusion of two separate sources one to the left of the listener and one to the right. The delayed source is perceived as an echo. - 27 - Chapter 2 Left Ear 0.03 0.02 0.01 Amplitude 0 -0.01 -0.02 -0.03 -0.04 -0.05 0 200 400 600 800 1000 Sample Number 1200 1400 1600 1800 2000 1200 1400 1600 1800 2000 Right Ear 0.02 0.01 Amplitude 0 -0.01 -0.02 -0.03 -0.04 Figure 2.15 0 200 400 600 800 1000 Sample Number Binaural impulse response from a source at 300 to the left of the listener. Dotted lines indicate some discrete reflections arriving at left ear. The above points help to explain why the ear/brain system uses the precedence effect. If a source has many early reflections (i.e. the source is in a reverberant room) the ear/brain system needs a way of discriminating between the direct sound and the room’s response to that sound (reflections and diffuse field). The precedence effect is the result of this phenomenon. If we take a source in a room (as given in figure 2.13), assuming the room is a 4 m by 4 m, square room and the initial source is 0.45m away from the listener (that is, the listener and source positions are as in Figure 2.13). The direct sound will take 1.3ms to reach the listener (taking the speed of sound in air as 342 ms-1). The source is at an azimuth of approximately 630 from straight ahead which will lead to a time difference between the ears of around 0.5 ms (using the approximate binaural distance equation from Gulick (1989)). The nearest reflection has a path length of around 3.2 m from the source to the listener which equates to a delay time of 9.4 ms. Because of the precedence effect, the first time delay between the ears will be utilised in the lateralisation of the sound source, and the first discrete echo will not be heard as an echo, but it will not change the perceived position of the sound source either, and - 28 - Chapter 2 will just change the width or timbre of the source. It is this type of processing in the ear/brain system that gives us vital information about the type of space we are situated in. However, as the above points suggest, it may be at the expense of localisation accuracy, with the precedence effect breaking down if the echo is louder than the direct sound, which normally only occurs if the source is out of sight, but a reflection path off a wall is the loudest sound to reach the listener. 2.3.2 Height and Distance Perception Although lateralisation has been discussed, no explanation has yet been given to resolution of sources that appear above or below the listener. As the ears of the listener are both on the same plane, horizontally, the sound reaching each ear will not contain any path differences due to elevation (although, obviously, if a sound is elevated and off-centre, the path differences for the lateral position of the sound will be present), and as there are no path differences the only static cue that can be utilised for an elevated cue is the comb filtering introduced by head and pinna. Figure 2.16 shows a 3-axis graph representing a source straight in front of the listener changing elevation angle from –400 to 900. Perhaps the most notable feature of this plot is the pronounced trough that originates at around 7 kHz for an elevation of –400, which goes through a smooth transition to around 11 kHz at an elevation of 600. It is most probably these pinna filtering cues (combined with head movements) that are used to resolve sources that are above and below the listener (Zwicker & Fastl, 1999). Interestingly, it has also been shown in Zwicker & Fastl (1999) that narrow, band-limited sources heard by a listener can have a ‘natural’ direction. For example, an 8 kHz centre frequency is perceived as coming from a location above the head of the subject, whereas a 1 kHz centre frequency is perceived as coming from a location behind the listener. - 29 - Chapter 2 Figure 2.16 Relationship between source elevation angle, frequency and the amplitude at an ear of a listener (source is at an azimuth of 00). In order to assess the apparent distance of a source to the listener, a number of auditory cues are used. The first and most obvious cue is that of amplitude. That is, a source that is near by will be louder than a source that is further away. The relationship between a point source’s amplitude and distance, in the free field, is known as the inverse square law as for each doubling of distance, the amplitude of the source will reduce by a quarter (1/[22]). This is, of course, the simplest case, only holding true for a point source in the free field. Sources are rarely a perfect point source, and rarely heard in the perfect free field (i.e. anechoic circumstances) so, in reality, the amplitude reduction is normally less than the inverse square law suggests. In addition to the pure amplitude changes, distance dependant filtering can be observed, due to air absorption (Savioja, 1999). This will result in a more lowpass filtered signal, the further away the source. The direct to reverberant ratio of the sound will change depending on the source’s distance to the listener with a source close to the listener exhibiting a large amount of direct sound when compared to the reverberation, but a sound further away will have a similar amount of reverberation, but a lower level of direct sound - 30 - Chapter 2 (Begault, 2000). There are two reasons for this. Firstly, the diffuse part of the room’s response (i.e. the part not made up of direct sound or first order reflections) is made up from the sound bouncing off many surfaces, and as such, will be present all through the room. This means that the level of this part of the reverberation is reasonably constant throughout the room. Also, as the source moves away from the listener, the distance ratio between the path length of the direct sound and the early reflections becomes closer to one. This means that the first reflections will arrive closer (in time), and have an amplitude that is more similar to the level of the direct sound. This is shown in Figure 2.17. Figure 2.17 A graph showing the direct sound and early reflections of two sources in a room. Evidence suggests that the reverberation cue is one of the more robust cues in the simulation of distance and has been shown to create the illusion of a sound source outside of the head under headphone listening conditions (McKeag & McGrath, 1997). Of all the cues available to differentiate source distances, the least apparent is that the source’s incidence angle from the listener’s ears will change as the source is moved away from a listener (Gulick, 1989). Figure 2.18 shows two source examples, one source very close to the listener, and one source at - 31 - Chapter 2 infinity. The close source has a greater binaural distance associated with it when compared to the far source. This means that as sources move offcentre, the binaural distance for a far source will not increase as quickly as the binaural distance for a near source (that is, the maximum binaural time difference is less for a far source). Near Source Figure 2.18 Far Source A near and far source impinging on the head. 2.4 Summary In summary, the ear/brain system uses a number of different cues when trying to make sense of the sounds that we hear. These consist of the low level cues that are a result of the position and shape of the ears, such as: • • • Interaural level differences. Interaural phase and time differences. Head/torso and pinna filtering. These cues are used by the ear/brain system to help determine the angular direction of a sound, but are also combined and processed using higher order cognitive functions in order to help make sense of such things as the environment that the sounds have occurred in. It is these higher order functions that give us the sense of the environment that we are in, assigning more information to the object than a directional characteristic alone. Such attributes as distance perception are formed in this way, but other attributes can also be attached in a similar manner, such as the size of an object, or an estimation as to whether the sounding object is facing us, or not (in the case of a person talking, for example). - 32 - Chapter 2 If a successful surround sound system is to be developed then it is apparent that not only should the low-level cues be satisfied, but they should also be as coherent with one another as possible so that the higher order cognitive functions of the ear/brain system can also be satisfied in a useful and meaningful way. - 33 - Chapter 3 Chapter 3 - Surround Sound Systems 3.1 Introduction In this chapter past and current surround sound algorithms and techniques will be discussed starting with a historical account of the first systems, proposed by Bell Labs and Alan Blumlein, how Blumlein’s early system was used as a loose basis for stereo, and then on to the theory and rationale behind the systems that are used presently. The early systems are of importance as most surround sound systems in use today base themselves on the techniques and principles of this early work. In the context of this research, one main system will be decided upon as warranting further research in order to fulfil the research problem detailed in Chapter 1, with the following criteria needing to be met: • • • A hierarchical carrier format must be decided upon. This carrier must be able to be decoded for multi-speaker systems with different speaker arrangements. This decode must be able to provide the listener with the relevant auditory cues which will translate well into a binaural representation. As the above system is to be converted into a binaural and transaural representation, these systems will also be discussed. 3.2 Historic Review of Surround Sound Techniques and Theory Although standard stereo equipment works with two channels, early work was not necessarily fixed to that number, with the stereo arrangement familiar to us today not becoming a standard until the 1950s. Bell labs original work was predicated on many more speakers than this initially (Rumsey & McCormick, 1994) and is the first system described in this section. 3.2.1 Bell Labs’ Early Spaced Microphone Technique The early aims of the first directional sound reproduction techniques tried at Bell Labs was that of trying to reproduce the sound wave front from a source - 34 - Chapter 3 on a stage (Rumsey & McCormick, 1994). A sound source was placed on a stage in a room; this was then picked up by a large number of closely spaced microphones in a row, in front of the source. These signals were then transmitted to an equal number of similarly spaced loudspeakers (as shown in Figure 3.1). Source Figure 3.1 Graphical depiction of early Bell Labs experiments. Infinite number of microphones and speakers model. The result was an accurate virtual image that did not depend on the position of the listener (within limits) as the wave front approaching the speakers is reproduced well, much like wave-field synthesis (to be discussed later in this chapter). Bell Labs then tried to see if they could recreate the same idea using a smaller number of speakers (Figure 3.2), but this did not perform as accurately (Steinberg, J. & Snow, W., 1934). The main problem with such a setup is that once the many speakers are removed, the three sources (as in the example shown in Figure 3.2) do not reconstruct the wave front correctly. Let us consider the three speaker example shown in Figure 3.2. If the source is recorded by three microphones, as shown, the middle microphone will receive the signal first, followed then by the microphone on the right, and lastly captured by the microphone on the left. These three signals are reproduced by the three loudspeakers. If the listener is placed directly in front of the middle loudspeaker, then the signal from the middle speaker will reach them first, followed by the right and left loudspeakers together. However, as the signal from the source was delayed in reaching the left and right - 35 - Chapter 3 microphones, the delay from each of the left and right speakers is increased even more. Now, if the combined spacing between the microphones and speakers equates to a spacing greater than the diameter of the head, then the time delays reproduced at the ears of the listener will be greater than the maximum interaural time difference of a real source. This will then result in either the precedence effect taking over (i.e. the source will emanate from the centre loudspeaker) or, worse still, echoes will be perceived. This is due to a phenomenon known as ‘spatial aliasing’ and will be described in more detail in section 3.3.2. The spacing of the microphones was necessary as directional microphones had not been invented at this point in time, and only pressure sensitive, omnidirectional microphones were available. Source Figure 3.2 Early Bell Labs experiment. Limited number of microphones and speakers model. 3.2.2 Blumlein’s Binaural Reproduction System While carrying out research into the work of Alan Blumlein, it was soon discovered that there seems to be some confusion, in the audio industry, about certain aspects of his inventions. This seems mainly due to the fact that the names of the various techniques he pioneered have been changed, or misquoted, from the names that he originally gave. Alan Blumlein delivered a patent specification in 1931 (Blumlein, 1931) that both recognised the problems with the Bell Labs approach and defined a method for converting spaced microphone feeds to a signal suitable for loudspeaker reproduction. Blumlein called his invention Binaural Reproduction. This recording technique comprised of two omni-directional microphones spaced at a distance similar - 36 - Chapter 3 to that found between the ears, with a round panel baffle in between them. This technique was known to work well for headphone listening, but did not perform as accurately when replayed on loudspeakers. Blumlein realised that for loudspeaker reproduction, phase differences at the speakers (i.e. in the spaced microphone recording) did not reproduce phase differences at the listener’s ears. This was due to the unavoidable crosstalk between the two speakers and the two ears of the listener, as shown in Figure 3.3. x Figure 3.3 x Standard “stereo triangle” with the speakers at +/-300 to the listener (x denotes the crosstalk path). Blumlein had discovered that in order to reproduce phase differences at the ears of a listener, level differences needed to be presented by the speakers. His invention included the description of a ‘Shuffling’ circuit, which is a device that converts the phase differences, present in spaced microphone recordings, to amplitude differences at low frequencies (as at higher frequencies the amplitude differences would already be present due to the sound shadow presented by the disk between the two microphones). If we consider the stereo pair of loudspeakers shown in Figure 3.3, it can be seen that there are two paths from each speaker to each ear of the listener. If the sound that is recorded from the Blumlein stereo pair of microphones is to the left of centre, then the left channel’s signal will be greater in amplitude than the right channel’s signal. Four signals will then be transmitted to the ears: 1. The left speaker to the left ear. - 37 - Chapter 3 2. The left speaker to the right ear. 3. The right speaker to the right ear. 4. The right speaker to the left ear. If we take the case of a low frequency sound (where the interaural phase difference is the major cue), as the paths from the speaker to the contralateral ear is longer than from the speaker to the ipsilateral ear, the signal will appear delayed in time (but not changed in amplitude, due to the wave diffracting around the head, see Chapter 2). The resulting signals that arrive at each ear are shown in Figure 3.4. Figure 3.4 Low frequency simulation of a source recorded in Blumlein Stereo and replayed over a pair of loudspeakers. The source is to the left of centre. It can be clearly seen that low frequency phase cues can be encoded into a stereo signal using just amplitude differences and once the head starts to become a physical obstacle for the reproduced signals (at higher frequencies), a level difference between the ears will also become apparent. It may seem strange that Blumlein used a spaced microphone array to model what seems to be a coincident, amplitude weighted, microphone technique, but only omnidirectional microphones were available at this time. However, less than a year later a directional, ribbon microphone appeared that had a figure of eight polar response. This microphone was better suited to Blumlein’s Binaural Reproduction technique. - 38 - Chapter 3 Figure 3.5 Polar pickup patterns for Blumlein Stereo technique Blumlein’s coincident microphone technique involved the use of two coincident microphones with figure of eight pickup patterns (Blumlein, 1931) (as shown in Figure 3.5) and has a number of advantages over the spaced microphone set-up shown in Figure 3.2. Firstly, this system is mono compatible, whereas spaced microphone techniques are generally not (if not shuffled). If we again consider the microphone arrangement given in Figure 3.2 then each of the microphones receives the same signal, but changed in delay and amplitude. As there are delays involved, adding up the different channels will produce comb-filtering effects (as different frequencies will cancel out and reinforce each other depending on their wavelengths). However, this will not be the case using Blumlein’s binaural sound as the two microphones will pick up the same signal, differing only in amplitude. A mono signal can be constructed by adding the left and right signals together resulting in a forward facing figure of eight response. The Blumlein approach also has the added advantage that the actual signals that are presented from each loudspeaker can be altered after the recording process. For example, the apparent width of the sound stage can be altered using various mixtures of the sum and difference signals (see spatial equalisation, later in this section). Also, Blumlein based his work on what the ear would hear, and - 39 - Chapter 3 described how a stereo image, made up of amplitude differences alone, could create low frequency phase cues at the ears of a listener (Blumlein, 1931). Blumlein did foresee one problem with his two microphone arrangement, however. This was that the amplitude and phase cues for mid and low frequencies, respectively, would not be in agreement (Blumlein, 1931; Glasgal, 2003a). It was possible to solve this problem using the fact that the signals fed to each speaker could be altered after recording using the sum and difference signals. This technique is now known as spatial equalisation (Gerzon, 1994), and consisted of changing the low frequency signals that fed the left and right speaker by boosting the difference signal and cutting the sum signal by the same amount (usually around 4dB). This has the effect of altering the pickup pattern for the recorded material in a manner shown in Figure 3.6. This technique is still used today, and is a basis for parts of the Lexicon Logic 7™ (Surround Sound Mailing List Archive, 2001) and Ambisonic systems (Gerzon, 1974), the principles of which will be discussed in detail later in this chapter. Figure 3.6 Graph showing the pick up patterns of the left speaker’s feed after spatial equalisation. Blumlein’s binaural reproduction technique is one of the few that truly separates the encoding of the signal from the decoding, which allows for the various post recording steps that can be carried out in a clearly defined, - 40 - Chapter 3 mathematically elegant way. Blumlein was soon employed by the military to work on radar, amongst other things. It may be because of this that Blumlein’s work was not openly recognised for a number of years (Alexander, 1997), but his principles were later used in the formulation of a three dimensional sound system (see Ambisonics, later in this chapter). 3.2.3 Stereo Spaced Microphone Techniques Although the Blumlein Stereo technique has many advantages as a recording format when used for reproduction over loudspeakers, there is another school of thought on this matter. This is that such ‘summation localisation theories’ cannot hope to accurately reproduce recorded material as no onset time delay is introduced into the equation, and if this is the case, then although steady state (continuous) signals can be reproduced faithfully, the onset of sounds cannot be reproduced with strong enough cues present to successfully fool the ear/brain system. To this end, a number of spaced microphone techniques were developed that circumvented some of the problems associated with Bell Labs wave front reconstruction technique described above. It must be noted, however, that Blumlein did use spaced microphone techniques to record sound as he was well aware that, for headphone listening, this produced the best results. However, in order to replay these recordings over speakers, to achieve externalisation, a Blumlein shuffler was used, that converted the signals, at low frequencies, to consist of only amplitude differences. If we recall from the Bell Labs system, anomalies occurred because of the potentially large spacing between the microphones that were picking up the sound sources. A more logical approach is a near-coincident microphone technique that will limit the time of arrival errors so that the maximum time difference experienced by a listener will not be perceived as an echo. The ORTF method uses a pair of spaced directional microphones usually spaced by around 17 cm (roughly equal to the diameter of a human head) and at an angle of separation of 1100 (as shown in Figure 3.7). This means that the largest possible time difference between the two channels is comparable with the largest time of arrival difference experienced by a real listener. Directional - 41 - Chapter 3 microphones are used to simulate the shadowing effect of the head. This arrangement is a trade off between spaced and coincident microphone techniques as it has the increased spaciousness of spaced microphones (due to the increased de-correlation of the two signals) but also has reasonably good mono compatibility due to the close proximity of the microphone capsules. 1100 17cm Figure 3.7 ORTF near-coincident microphone technique. Another widely used technique is the Decca Tree (Rumsey and McCormick, 1994). This is a group of three microphones matrixed together to create two loudspeaker feeds. An example of the Decca Tree arrangement is shown in Figure 3.8. In this arrangement the centre microphone feed is sent to both channels, the left microphone feed is sent to the left channel and the right microphone is sent to the right channel. In this way, the differences between the two channels outputs are lessened, giving a more stable central image, and alleviating the ‘hole in the middle’ type effect of a spaced omni technique (the sound always seeming to originate from a specific speaker, as in the Bell Labs set-up). - 42 - 1.5m Chapter 3 2m Figure 3.8 Typical Decca Tree microphone arrangement (using omni-directional capsules). 3.2.4 Pan-potted Stereo The systems that have been discussed thus far have been able to record events for multiple speaker playback, but a system was needed that could be used to artificially place sources in the desired location to create the illusion of a recorded situation. Due to the simplicity of Blumlein stereo, as opposed to spaced microphone techniques, creating a system where individual sources could be artificially positioned was based on amplitude panning (Rumsey and McCormick, 1994). So, a simulation of the Blumlein coincident microphone system was needed. As the coincident microphones were figure of eight responses the gains needed to artificially pan a sound from the left speaker to the right speaker are given in equation (3.1). The SPos offset parameter is basically to ‘steer’ the virtual figure-of-eight responses so that a signal at one speaker position will have no gain at the opposite speaker, i.e. a virtual source at the speaker position is an actual source at the speaker position. LeftGain = sin(θ + SPos) RightGain = cos(θ + SPos) (3.1) where: SPos is the absolute angular position of the speaker. θ is the desired source position (from SPos0 to –SPos0). - 43 - Chapter 3 Figure 3.9 A stereo panning law based on Blumlein stereo. This is, however, really a simplification of Blumlein’s stereo technique as his spatial equalisation circuit is generally not used in amplitude stereo panning techniques. Simple amplitude (or pair-wise panning) has now been used for many years, but does suffer from a few problems. It has been shown that the maximum speaker separation that can be successfully utilised is +/- 300 and that sideimaging is very hard to achieve using this method (Glasgal, 2003b). Both of these facts are not necessarily detrimental to simple two-speaker stereo reproduction, but will present a larger problem with surround sound techniques as this would mean a minimum of six equally spaced speakers placed around the speaker would need to be used (based on only the angular spacing assumption). In summary, there are basically two schools of thought when it comes to the recording of live situations for replay over a stereo speaker array (pan-potted, stereo, material is almost always amplitude panned, although artificial reverberation devices often mimic a spaced microphone array rather than a coincident setup). There are those that abide by spaced microphone techniques, reasoning that the time onset cues are very important to the - 44 - Chapter 3 ear/brain system (i.e. the precedence effect) and these are impossible to recreate using a coincident microphone arrangement. On the other side there are those who prefer the mathematical simplicity of coincident microphone arrangements, believing that the potential phase/time misalignment of the signals originating from the speakers in spaced microphone techniques to be detrimental to both the timbre and accuracy of the recorded material. Of course, both are correct to a certain degree and both coincident and spaced techniques can produce very pleasing results. However, the main problem with spaced microphone techniques is that, because potentially unknown time differences will be present between the two channels, the practical reprocessing of new signal feeds becomes much more difficult, while not an issue for two-speaker stereo, will become an issue for larger arrays of speakers. 3.2.5 Enhanced Stereo As can be deduced from both Blumlein and Bell Labs early work, stereo sound (which, incidentally, neither Blumlein or Bell Labs referred to their work as ‘Stereo’ sound) was never limited, theoretically, to just two speakers, as their work was mainly geared towards film sound reproduction that needed to encompass large audiences. Three speakers was a good minimum for such a situation as it was soon found that angular distortion was not too detrimental to the experience, except when it came to dialogue (Blumlein’s original idea of the dialogue following the actors was not widely taken up). Dialogue needed to always sound as if it was coming from the screen and not the nearest speaker to the listener, which could happen due to the precedence effect. To this end the centre speaker was useful for both fixing dialogue to the centre of the sound stage, and also for increasing the useful listening area of the room. If a source is panned between two speakers, then a mixture of the time difference and the level difference between the ears will be used to calculate where the sound source is originating from. So, if the listener is in the centre of the two speakers the time (phase) cues will be constructed from the level differences between the speakers. However, as the listener moves off-centre the time delay from the two speakers will change the perceived direction of the sound source. This time difference can be counteracted by the amplitude - 45 - Chapter 3 differences between the two speakers, but angular distortion will always occur, and once the listener is much closer to one speaker than the other, all but the hardest panned material will tend to emanate from the closer of the two loudspeakers. Hence, having a centre speaker not only fixed dialogue to the screen, but also lessened the maximum time difference that could be experienced between two speakers at any one time. 3.2.6 Dolby Stereo Much of the motivation for early surround sound implementations was the cinema, and early multi-channel playback was attempted as early as 1939 in the Disney film, Fantasia (Kay et al. 1998). However, although a magnetic multi-channel standard had been available since the 1950’s (Dolby Labs, 2002), it was not as robust or long lasting as the mono optical track that was used at this time. Dolby was to change this in 1975 mainly due to the use of their noise reduction techniques that had revolutionised the professional recording industry since the 1960’s. The optical system in use at that time had a number of problems associated with it. The standard for the mono track’s frequency response was developed in the 1930’s which, although making the soundtrack replayable in almost any cinema in the world, reduced the bandwidth to that of a telephone. This response, called the Academy characteristic (Dolby Labs, 2002), also meant that the soundtracks were recorded with so much high frequency pre-emphasis that considerable distortion was also present in the audio. Dolby’s research found that most of these problems were because of the low signal to noise ratio of the optical transmission medium, and in the late 1960’s looked at using their type A noise reduction systems in order to improve the response of the sound. Although this worked very well, the noise reduction was not embraced as enthusiastically as for the professional audio industry and Dolby decided that if it was to make serious ground in the film industry it was the number of channels available, and not solely the sound quality that would gain success. In 1975 Dolby made public their film sound breakthrough. Using the same optical technology as was already in place, a new four-channel stereo system was introduced (Dolby Labs, 2002). It worked by storing just two channels of audio which represented the left and right speaker feeds. Then, the sum of - 46 - Chapter 3 these two channels represented the centre channel, and the difference between these two signals represented the surround feed. These principles were updated slightly due to the nature of the storage mechanism and replay situations. 1. Due to the potential phase misalignment and other analogue imperfections in the replay medium, high frequency sounds intended for the centre front speaker could leak back into the surround speakers. For this reason, the surround channels were band limited to around 7 kHz. 2. The surround speakers found in cinemas were often closer to the listener than the front speakers were. To make sure that the precedence effect didn’t pull much of the imaging to the back and sides, the surround feeds were delayed. 3. The surround feed was phase shifted by +/- 900 prior to being added to the left and right channels. This meant that any material added to the surround channel would be summed, equally out of phase, with the left and right channels (as opposed to one in phase, one out of phase). A simplified block diagram of the Dolby encode/decode process is shown in Figure 3.10. This, matrix, surround sound technique had a number of points in its favour: 1. It could be distributed using just two channels of audio 2. It was still an optical, and therefore cheap and robust, recording method. 3. The stereo track was mono compatible. 4. A new curve characteristic was used which, when coupled with Dolby noise reduction, greatly improved the fidelity of cinema sound. For these reasons, the film industry took to the new Dolby Stereo format. - 47 - Chapter 3 + Left Centre + + +j + + -j + Lt - 3dB Right Surround - 3dB Rt + + Bandpass 100Hz – 7Khz Dolby Stereo Encoding Process Sur Left -1 Lt Left + 20mS Delay - 900 Phase Shifter + -3dB + Rt Centre Right Sur Right Dolby Stereo Decoding Process Figure 3.10 Simplified block diagram of the Dolby Stereo encode/decode process 3.2.7 Quadraphonics While Dolby was concentrating on film sound reproduction, surround sound techniques were being developed for a wider audience (in the home) and the first of these systems was termed Quadraphonics. Quadraphonics worked on the principle that if the listener wanted to be surrounded by sound then all that would be needed was an extension of the stereo panning law described above, but moving between four loudspeakers. The loudspeakers were setup in a square (usually) and sounds could theoretically be pair-wise panned to any azimuth around the listener. However, it was soon shown that +/- 450 was too wide a panning angle at the front and back, and side images could not be formed satisfactorily using pair-wise panning techniques (Gerzon, 1974b & 1985). This, coupled with a number of incompatible formats, the extra expense needed for more speakers/amplifiers and the poor performance of early Quadraphonic matrix decoders meant that Quadraphonics was not a commercial success. - 48 - Chapter 3 3.3 Review of Present Surround Sound Techniques This section describes systems that are now still generating work and interest within the surround sound community (not necessarily any newer than some systems mentioned in section 3.2). Systems in use today can be separated into two distinct categories: 1. Systems that define a speaker layout and/or carrier medium but with no reference to how signals are captured and/recorded for the system. Examples include o Dolby Digital - Ac-3 (Dolby Labs, 2004) o DTS (Kramer, N.D.) o Meridian Lossless packaging (De Lancie, 1998) 2. Systems that define how material is captured and/or panned for replay over a specified speaker layout. Examples include o Ambisonics o Wavefield Synthesis o Ambiophonics This thesis will concentrate on the systems in the 2nd of these categories, that define how material is captured and replayed over a system as the 1st type of system is just defining a standard for which the 2nd category of system could be applied to (for example, both DTS and Dolby Digital are both lossy, perceptual codecs used to efficiently store 6 discrete channels to be played over a standard, ITU, 5.1 speaker array) 3.3.1 Ambisonics 3.3.1.1 Theory Ambisonics was a system pioneered mainly by Michael Gerzon and is based on the spherical harmonic decomposition of a sound field (Gerzon, 1974). In order to understand this last statement the fundamentals of Ambisonics are reviewed. A definition for what makes a decoder Ambisonic can be found in Gerzon & Barton (1992) and their equivalent U.S. patent regarding Ambisonic decoders - 49 - Chapter 3 for irregular arrays (Gerzon & Barton, 1998), and states (slightly adapted to remove equations): A decoder or reproduction system is defined to be Ambisonic if, for a centrally seated listening position, it is designed such that: • • • The decoded velocity and energy vector angles agree and are substantially unchanged with frequency. At low frequencies (below around 400 Hz) the low frequency velocity vector magnitude is equal to 1 for all reproduced azimuths. At mid/high frequencies (between around 700 Hz and 4 kHz) the energy vector magnitude is substantially maximised across as large a part of the 3600 sound stage as possible. To understand these statements, the underlying concepts of Ambisonics will be explained, leading into a description of the velocity and energy vectors and their relevance to multi-speaker surround sound systems. Ambisonics is a logical extension of Blumlein’s binaural reproduction system (at least, after it’s conception). Probably one of the most forward looking features of the Blumlein technique is that when using the two figure of eight capsules positioned perpendicular to each other, any other figure of eight response could be created (it was this fact that was utilised in Blumlein’s spatial equalisation technique). For example, if we take the two figure of eight microphones shown in Figure 3.5, then any figure of eight microphone response can be constructed using the equations shown in Equation (3.2). Some example microphone responses have been plotted in Figure 3.11. Sum = (L + R ) Dif = (L − R ) 2 Figure8 = (cos(θ ) × Sum ) + (sin (θ ) × Dif ) where: 2 θ is the desired response angle. L is the left facing figure of eight microphone. R is the right facing figure of eight microphone. - 50 - (3.2) Chapter 3 Figure8 is the reconstructed figure of eight microphone. Figure 3.11 Plot of microphone responses derived from two figure of eight microphones. This approach is very similar to Gerzon’s in that the encoding (recording) side is independent from the decoding (reproduction) process. That is, Blumlein stereo could be replayed over 1, 2 or more speakers. Where Gerzon’s Ambisonics improves upon this idea is as follows: • Ambisonics can be used to recreate a full three dimensional sound field (i.e. height information can also be extracted from the Ambisonics • system). The decoded polar pattern can be changed, that is, you are not fixed to using a figure of eight response. As an example, 1st order Ambisonics can represent a sound field using four signals (collectively known as B-Format). The W signal is an omni-directional pressure signal that represents the zeroth order component of the sound field and X, Y and Z are figure of eight microphones used to record the particle velocity in any one of the three dimensions. Graphical representations of these four B-Format microphone signal responses are given in Figure 3.12. - 51 - Chapter 3 W X Y Z Figure 3.12 The four microphone pickup patterns needed to record first order Ambisonics (note, red represents in-phase, and blue represents out-ofphase pickup). Ambisonics is a hierarchical format so that although four channels are needed for full three-dimensional reproduction, only three channels are needed if the final replay system is a horizontal only system. The mathematical equations representing the four microphone responses shown in Figure 3.12 are shown in equation (3.3). These equations can also be used to encode a sound source and represent the gains applied to the sound for each channel of the B-format signal. W =1 X = cos(θ ) × cos(α ) Y = sin (θ ) × cos(α ) 2 Z = sin (α ) where: α = elevation angle of the source. (3.3) θ = azimuth angle of the source. In order to replay a B-Format signal, virtual microphone responses are calculated and fed to each speaker. That is, using the B-format signals, any 1st order microphone response can be obtained pointing in any direction. As mentioned before, this is very much like the theory behind Blumlein Stereo, except that you can choose the virtual microphone response from any first - 52 - Chapter 3 order pattern (and not just a figure of eight), from omni to figure of eight. This is possible using the simple equation shown in equation (3.4) (Farina et al., 2001) gw = 2 g x = cos(θ )cos(α ) g y = sin (θ )cos(α ) g z = sin (α ) [ ] S = 0.5 × (2 − d )g wW + d (g x X + g y Y + g z Z ) (3.4) where: W,X,Y & Z are the B-format signals given in equation (3.3) S = speaker output θ = speaker azimuth α = speaker elevation d = directivity factor (0 to 2) This gives us the flexibility to alter the polar pattern for each speaker in a decoder. Example patterns are shown in Figure 3.13. To clarify the Ambisonic encode/decode process, let us encode a mono source at an azimuth of 350 and an elevation of 00 and replay this over a six speaker, hexagonal rig. - 53 - Chapter 3 Figure 3.13 Graphical representation of the variable polar patterns available using first order Ambisonics (in 2 dimensions, in this case). From equation (3.3) the B-format (W, X, Y and Z) signals will consist of the amplitude weighted signals shown in equation (3.5). W = 0.7071 x mono X = cos(35)cos(0) x mono = 0.8192 x mono Y = sin(35)cos(0) x mono = 0.5736 x mono Z = sin(0) x mono = 0 x mono (3.5) Where: mono is the sound source to be panned W, X, Y & Z are the resulting B-Format signals after mono has had the directionally dependant amplitude weightings applied. Equation (3.4) can now be used to decode this B-format signal. In this case a cardioid response will be used for each speaker’s decoded feed, which equates to a directivity factor of 1 (see Figure 3.13). Equation (3.6) shows an example speaker feed for a speaker located at 1500 azimuth and 00 elevation. - 54 - Chapter 3 S = 0.5 x [(1.414 x W) + (-0.866 x X) + (0.5 x Y) + (0 x Z)] (3.6) where: W, X & Y are the encoded B-Format signals. S = resulting speaker feed The polar pattern used for the decoder can be decided either by personal preference, that is, by some form of empirically derived setting, or by a theoretical calculation which obtains the optimum decoding scheme. This leads us back to the original statement of what makes a system Ambisonic. Although the B-format input signal is the simplest to use for the Ambisonic system, the term Ambisonics is actually more associated with how a multi-channel decode can be obtained that maximises the accuracy of the reproduced sound field. The three statements given at the beginning of this section mention the energy and velocity vectors associated with a multispeaker presentation, and it is using these that an Ambisonic decoder can be designed. 3.3.1.2 Psychoacoustic Decoder Design Using the Energy and Velocity Vectors. Although Gerzon defined what makes a system Ambisonic, a number of different decoding types have been suggested both by Gerzon himself and by others (see Malham, 1998 and Farino & Uglotti, 1998). However, the theory behind Ambisonics is, as already mentioned, similar to Blumlein’s original idea that in order to design a psychoacoustically correct reproduction system the two lateralisation parameters must be optimised with respect to a centrally seated listener (Gerzon, 1974). Originally, Gerzon’s work concentrated on regularly spaced arrays in two and three dimensions (such as square and cuboid arrays) where the virtual microphone responses chosen for the decoders were based on the system being quantified using the principles of energy and velocity vectors calculated at the centre of the array to be designed. These two vectors have been shown to estimate the perceived localisation and quality of a virtual source - 55 - Chapter 3 when reproduced using multiple speakers (Gerzon, 1992c). The equations used to calculate the energy and velocity vectors are shown in Equation (3.7) with the vector lengths representing a measure of the ‘quality’ of localisation, and the vector angle representing the direction that the sound is perceived to originate from, with a vector length of one indicating a good localisation effect. P = ∑ gi E = ∑ g i2 n n i =1 i =1 Vx = ∑ g i cos(θ i ) P Ex = ∑ g i2 cos(θ i ) E n n i =0 i =0 Vy = ∑ g i sin (θ i ) P n i =0 Ey = ∑ g i2 sin (θ i ) E n i =0 (3.7) Where: gi represents the gain of the ith speaker (assumed real for simplicity). n is the number of speakers. θi is the angular position of the ith speaker. These equations use the gain of the speakers in the array, when decoding a virtual source from many directions around the unit circle (each speaker’s gain can be calculated using the B-Format encoding equations given in Equation (3.3) combined with the decoding equation given in Equation (3.4)). For regular arrays, as long as the virtual microphone responses used to feed the speakers were the same for all, the following points can be observed: • • The reproduced angle would always be the same as the source’s encoded angle. The energy (E) and pressure (P) values (which indicate the perceived volume of a reproduced source) would always be the same for any reproduced angle. This meant that when optimising a decoder designed to feed a regular array of speakers: • Only the length of the velocity and energy vectors had to be optimised (made as close to 1 as possible). - 56 - Chapter 3 • This could be achieved by simply changing the pattern control (d) in equation (3.4) differently for low (<700Hz) and high (>700Hz) frequencies. As an example Figure 3.14 shows the velocity and energy vector plots of an eight speaker horizontal Ambisonic array using virtual cardioid responses for each speaker feed. D low = 1 : D high = 1 1 Encode Angles Decode Angles 0.5 Unit Circle 0 Energy Vectors -0.5 Velocity Vectors Speakers -1 -1.5 Figure 3.14 -1 -0.5 0 0.5 1 1.5 Velocity and Energy Vector plot of an eight-speaker array using virtual cardioids (low and high frequency directivity of d=1). In order to maximise the performance of this decoder according to Gerzon’s methods, the low frequency (velocity) vector length should be 1, and the high frequency (energy) vector length should be as close to 1 as possible (it is impossible to realise a virtual source with a energy vector of one, as more than one source is reproducing it). This can be achieved by using a low frequency directivity pattern of d=1.33 and a high frequency directivity pattern of d=1.15. This produces the virtual microphone patterns as shown in Figure 3.15 (showing the low frequency pattern for a speaker at 00 and a high frequency pattern for a speaker at 1800 in order to make each pattern easier to observe) and has a corresponding velocity and energy vector plot as shown in Figure 3.16. - 57 - Chapter 3 Virtual microphone responses for a 1st order, eight speaker rig 90 1 120 60 0.8 0.6 150 30 0.4 0.2 180 0 330 210 HF Polar Response LF Polar Response 300 240 270 Figure 3.15 Virtual microphone responses that maximise the energy and velocity vector responses for an eight speaker rig (shown at 00 and 1800 for clarity). D low = 1.33 : D high = 1.15 1 0.5 0 -0.5 -1 -1.5 Figure 3.16 -1 -0.5 0 0.5 1 1.5 Velocity and Energy Vector plot of an eight speaker Ambisonic decode using the low and high frequency polar patterns shown in Figure 3.16. As can be seen in Equation (3.4), a change of polar pattern in the decoding equation will result in two gain offsets; one applied to the W signal, and another applied to the X, Y and Z signals. This could be realised, algorithmically, by the use of shelving filters boosting and cutting the W, X, Y - 58 - Chapter 3 and Z signals by the desired amount prior to decoding, which simplified the design of, what was at the time, an analogue decoder. It soon became apparent that, due to both the cinema and proposals for high definition television, the standard speaker layout for use in the home was not going to be a regular array. Gerzon had always had difficulty in solving the velocity and energy vector equations for irregular arrays because irregular arrays would generally need optimising, not only for the vector lengths, but also for the decoded source angles and the perceived volume of the decoder (for example, more speakers in the front hemisphere, when compared to the rear, would cause sources to be louder when in that hemisphere). This meant that a set of non-linear simultaneous equations needed to be solved. Also, the shelving filter technique used for regular decoders could not be used for irregular decoders as it was not just the polar pattern of the virtual microphones that needed to be altered. To this end a paper was published in 1992 (Gerzon & Barton, 1992) describing how a cross-over filter technique could be used along with two decoder designs, one for the low frequency and one for the high frequencies, in order to solve the irregular speaker problem. In the Gerzon & Barton (1992) paper, a number of irregular Ambisonic decoders were designed, however, although many five speaker decoder examples were given, none were as irregular as the ITU finally specified. For example, the front and rear spacing of the ITU layout are +/- 300 from straight ahead and +/- 700 from directly behind the listener, respectively, but the decoders Gerzon designed always had a front and rear spacing that were similar to each other (e.g. +/-350 front and +/- 450 rear) and although much work has been carried out on Ambisonics, a psychoacoustically correct ‘Vienna style’ decoder (named after the AES conference in Vienna where the Gerzon & Barton paper was presented) has not yet been calculated. It must also be noted that Gerzon’s method for solving these equations was, by his own admission, are “very tedious and messy” (Gerzon & Barton, 1992) and it can be observed, by visualising the velocity and vector responses, in a similar manner to Figure 3.16, that this paper does not solve the equations optimally. - 59 - Chapter 3 This is due to the splitting of the encoding and the decoding by Gerzon. An example of a decoder optimised by Gerzon & Barton is shown in Figure 3.17 Speakers Velocity Vector Sound Pressure Level 0,12.25,22.5, 45,90 & 135 degrees reproduced angles Energy Vector Figure 3.17 Energy and velocity vector analysis of an irregular speaker decode optimised by Gerzon & Barton (1992). It can be clearly seen, in Figure 3.17, that the high frequency decode (green line representing the energy vector) has reproduced angles that do not match up with the low frequency velocity vector response. This is due to the fact that the Gerzon & Barton paper suggests that although the vector length and reproduced angle parameters should be optimised simultaneously for the high frequency energy vector, a forward dominance adjustment (transformation of the B-format input signal) should then be carried out to ensure that perceived volume of the high frequency decoder is not biased towards the back of the speaker array. This, inevitably, causes the reproduced angles to be shifted forward. 3.3.1.3 B-Format Encoding The encoding equations (3.3) are basically a simulation of a B-format microphone (such as the SoundField Microphone, SoundField Ltd., n.d.) which has a four-channel response as shown in Figure 3.12. However, recording coincidentally in three dimensions proves to be extremely difficult. Coincident microphone techniques in two dimensions (see 3.2.2, Blumlein’s , page 36) are possible where the microphones can be made coincident in the - 60 - Chapter 3 X – Y axis but not in the Z axis (although this still causes some mis-alignment problems); however, in three dimensions this is not desirable as recording needs to be equally accurate in all three dimensions. This problem was solved by Gerzon and Craven (Craven & Gerzon, 1977) by the use of four sub cardioid microphone capsules mounted in a tetrahedral arrangement. This arrangement is shown in Figure 3.18. Figure 3.18 Four microphone capsules in a tetrahedral arrangement. The capsules are not exactly coincident, but they are equally non-coincident in each axis’ direction, which is important as this will simplify the correction of the non-coincident response. However, to aid in the explanation of the principles of operation of this microphone the capsule responses will, for now, be assumed to be exactly coincident and of cardioid response. As shown in Figure 3.18, each of the four microphone capsules faces in a different direction: Table 3.1 Capsule Azimuth Elevation A 450 35.30 B 1350 -35.30 C -450 -35.30 D -1350 35.30 SoundField Microphone Capsule Orientation As each of the capsules has a cardioid pattern (in this example) all sound that the capsules pick up will be in phase. Simple manipulations can be performed on these four capsules (know collectively as A-format) so as to construct the four pick-up patterns of B-format as shown in equation (3.8). A - 61 - Chapter 3 graphical representation of the four cardioid capsule responses and the four first order components derived from these are shown in Figure 3.19. W = 0.5 × ( A + B + C + D ) X = ( A + C ) − (B + D ) Y = ( A + B ) − (C + D ) Z = ( A + D ) − (B + C ) (3.8) A-Format W from A X from A Z from A Y from A Figure 3.19 B-Format spherical harmonics derived from the four cardioid capsules of an A-format microphone (assuming perfect coincidence). Red represents in-phase and blue represents out-of-phase pickup. As is evident from Figure 3.19, four perfectly coincident cardioid microphone capsules arranged as described above can perfectly recreate a first order, Bformat, signal. However, as mentioned earlier, the four capsules providing the A-format signals are not perfectly coincident. This has the effect of misaligning the capsules in time/phase (they are so close that they do not significantly affect the amplitude response of the capsules), which results in colouration (filtering) of the resulting B-format signals. As all of the capsules are equally non-coincident then any colouration will be the same for each order, i.e. the 0th order component will be filtered in one way, and the 1st order components will be filtered in another way. However, using cardioid microphone pickup patterns causes the frequency response of the B-format signals to fluctuate too much, and so for the actual implementation of the microphone, sub-cardioid polar patterns were used (as shown in Figure 3.20). - 62 - Chapter 3 To illustrate the frequency response characteristics of an Ambisonic microphone, it is simpler to assume that the microphone only works horizontally. Each of the four sub-cardioid capsules has no elevation angle, only an azimuth as described earlier. The equations that construct W, X, and Y will still be the same (3.8), but the Z component will not be constructed. Figure 3.20 shows a number of representations of a sound being recorded from four different directions, 00, 150, 300 and 450 and indicates what amplitude each capsule will record, what timing mismatches will be present (although, note that the sample scaling of this figure is over-sampled many times), and finally a frequency response for the W and X signals. It can be seen that the two channels not only have different frequency responses, but also these responses change as the source moves around the microphone. It must be remembered that the overall amplitude of the X channel will change due to the fact that the X channel has a figure of eight response. Looking at Figure 3.20 shows a clear problem with having the capsules spaced in this way, and that is the fact that the frequency response of the B-format signals changes as the source moves around the microphone. The smaller the spacing, the less of a problem it becomes (as the changes move up in frequency due to the shortening of the wavelengths when compared to the spacing of the capsules), and Figure 3.20 is based on the approximate spacing that it part of the SoundField MKV microphone (Farrah, 1979a). Figure 3.20 Simulated frequency responses of a two-dimensional, multi-capsule Aformat to B-format processing using a capsule spacing radius of 1.2cm. - 63 - Chapter 3 These responses can be corrected using filtering techniques, but only the average response will be correct, with the sound changing timbrally as it is moved around the microphone. Although the frequency response deviations sound like a large problem, they are not noticed and are combined with other errors in the signal chain such as microphone capsule imperfections and loudspeaker responses. Also Farrah (1979b) claims that similar coincident stereo techniques have a far greater error than the SoundField microphone anyway – “Closeness of the array allows compensations to be applied to produce B-format signal components effectively coincident up to about 10 kHz. This contrasts vividly with conventional stereo microphones where capsule spacing restricts coincident signals up to about 1.5 kHz”. What is being referred to here is the frequency at which the filtering becomes non-constant. If the graphs in the omnidirectional signal response are observed, it can be seen that its frequency response remains constant up to around 15 kHz, and it is the spacing of the capsules that defines this frequency. The closer the capsules, the higher the frequency until non-uniformity is observed. The SoundField microphone has many advantages over other multi-channel microphone techniques, with the main advantage being the obvious one in that it is just one microphone, and therefore needs no lining up with other microphones. Also, any combination of coincident first order microphones can be extracted from the B-format signals, which implies that the B-format signal itself can be manipulated, and this is indeed true. Manipulations including rotation, tumble and tilt are possible (Malham, 1998) along with being able to zoom (Malham 1998) into a part of the sound field, which alters the balance along any axis. Equations for these manipulations are given in (3.9). - 64 - Chapter 3 X – Zoom W′ = W + 1 ⋅d ⋅ X 2 X ′ = X + 2 ⋅ d ⋅W Y′ = 1− d 2 ⋅Y Z′ = 1− d 2 Z Rotation about Z Rotation about X W′ = W X ′ = X ⋅ cos(θ ) + Y ⋅ sin (θ ) Y ′ = Y ⋅ cos(θ ) − X ⋅ sin (θ ) W′ = W X′ = X Y ′ = Y ⋅ cos(θ ) − Z ⋅ sin (θ ) Z ′ = Z ⋅ cos(θ ) + Y ⋅ sin (θ ) Z′ = Z (3.9) where d is the dominance parameter (from –1 to 1). θ is the angle of rotation. A graphical representation of the effect that the zoom, or dominance, control has on the horizontal B-format polar patterns is shown in Figure 3.21. d=-0.5 Figure 3.21 d=0 d=0.5 Effect of B-format zoom parameter on W, X, and Y signals. As is evident from Figure 3.21 and Equation (3.9), the dominance parameter works by contaminating the W signal with the X signal and visa versa, which means that any speaker feeds taking in X and W will have these signals exaggerated if both are in phase, or cancelled out, if both are out of phase with each other. This coupled with the attenuation of the Y and Z channels means that any derived speaker feeds/virtual microphone patterns will be biased towards the X axis. Dominance in the Y and Z directions can also be achieved in the same way. 3.3.1.4 Higher Order Ambisonics Ambisonics is a very flexible system with its only main drawback being that only a first order microphone system is commercially available (however, it - 65 - Chapter 3 must be noted that all commercially available microphones have a first order polar pattern at present). However, as the name first order suggests, higher order signals can be used in the Ambisonics system, and the theory needed to record higher order circular harmonics has been discussed in a paper by Mark Poletti (Poletti, 2000). A 2nd order system has nine channels for full periphony (as opposed to the four channels of 1st order) and five channels for horizontal only recording and reproduction (as opposed to three channels for 1st order). The equations for the nine 2nd order channels are given in (3.10) (Furse, n.d.). W =1 2 X = cos(θ ) × cos(α ) Y = sin (θ ) × cos(α ) Z = sin (α ) R = 1.5 × sin 2 (α ) − 0.5 S = cos(θ ) × sin (2α ) T = sin (θ ) × sin (2α ) R S U = cos(2θ ) × cos 2 (α ) V = sin (2θ ) × cos 2 (α ) where: U T V α = elevation angle of the source. θ = azimuth angle of the source. For horizontal only work α is fixed at zero which makes the Z, R, S, & T (3.10) channels hold at zero, meaning that only W, X, Y, U & V are used. To demonstrate the difference in polar patterns (horizontally) between 1st, 2nd, 3rd and 4th order polar patterns (using equal weightings of each order), see Figure 3.22. - 66 - Chapter 3 Figure 3.22 Four different decodes of a point source polar patterns of 1st, 2nd, 3rd & 4th order systems (using virtual cardioid pattern as a 1st order reference and equal weightings of each order). Calculated using formula based on equation (3.4), using an azimuth of 1800 and an elevation of 00 and a directivity factor (d) of 1. Higher order polar patterns, when decoded, do not imply that fewer speakers are working at the same time; they are just working in a different way to reconstruct the original sound field. Figure 3.23 shows the decoded levels for an infinite number of speakers placed on the unit circle. The virtual source is placed at 1800 and the virtual decoder polar pattern is set to that shown in Figure 3.22. The multiple lobes can clearly be seen at 1800 for the second order decode and at approximately 1300 and 2500 for the third order decode. Note that the peak at the source position is not necessarily the same for each Ambisonic order (the responses were scaled in Figure 3.22, but this is a decoder issue), but the sum of all the decoder feeds (divided by the number of speakers) is equal to 1 for each order. This means that the measured pressure value at the middle of the speaker array will be consistent. - 67 - Chapter 3 Figure 3.23 An infinite speaker decoding of a 1st, 2nd, 3rd & 4th order Ambisonic source at 1800. The decoder’s virtual microphone pattern for each order is shown in Figure 3.22. One point not mentioned so far is that there are a minimum number of speakers needed to successfully reproduce each Ambisonic order, which is always greater than the number of transmission channels available for the decoder (Gerzon, 1985). This problem can be compared with the aliasing problem in digital audio, that is, enough ‘samples’ must be used in the reproduction array in order to reproduce the curves shown in Figure 3.23. For example, if we take a 1st and a 2nd order signal and reproduce this over four speakers (knowing that a 2nd order signal will need at least six speakers to be reproduced correctly) then the amplitude of the signals at the four speakers is shown in Figure 3.24. It can clearly be seen that speakers two and four (at 900 and 2700 respectively) have no output, whereas speaker 3 (positioned at 1800) has an amplitude of 1, coupled with the opposite speaker (at 00) having an output amplitude of 1/3. - 68 - Chapter 3 Figure 3.24 Graph of the speaker outputs for a 1st and 2nd order signal, using four speakers (last point is a repeat of the first, i.e. 00/3600) and a source position of 1800. This will result in the image pulling towards one speaker when the source position is near that direction. This is also shown in the research by Gerzon (1985) and will cause the decoding to favour the directions at the speaker locations. This is detrimental to the reproduced sound field as one of the resounding features of Ambisonics is that all directions are given a constant error, making the speakers ‘disappear’, which is one reason as to why Ambisonics can give such a natural sounding reproduction. Recent work by Craven (2003) has now described a panning law (as described in the paper, which is analogous to an Ambisonic decoder) for irregular speaker arrays using 4th order circular harmonics. This uses the velocity and energy vector theories mentioned above to optimise the decoder for the ITU irregular 5-speaker array. What is interesting about this decoder is that although 4th order circular harmonics are used, the polar patterns used for the virtual microphone signals are not strictly 4th order (as shown in Figure 3.22) but are ‘contaminated’ with 2nd, 3rd and 4th order components in order to steer the virtual microphone polar patterns so that the performance of the - 69 - Chapter 3 decoder is maximised (which means having a high order front and low order rear decode, dependant on speaker density). The velocity and energy vector analysis of the 4th order decoder used by Craven (2003) can be found in Figure 3.25 and the corresponding virtual microphone patterns can be seen in Figure 3.26. Figure 3.25 Energy and Velocity Vector Analysis of a 4th Order Ambisonic decoder for use with the ITU irregular speaker array, as proposed by Craven (2003). Figure 3.26 Virtual microphone patterns used for the irregular Ambisonic decoder as shown in Figure 3.25. - 70 - Chapter 3 It must also be noted that a number of researchers have now started to work on much higher orders of Ambisonics (for example, 18th order) and it is at these orders that Ambisonics does, indeed, tend towards a system similar to wavefield synthesis (see Sontacchi & Holdrich, 2003 and Daniel et al., 2003) and although these, much higher order systems, will not be utilised in this report, the underlying principles remain the same. 3.3.1.5 Summary Ambisonics is an ideal system to work with for a number of reasons: • • • It has both a well defined storage format and simple synthesis equations, making it useful for both recording/mixing and real-time synthesis. The encoding is separated from the decoding resulting in a system where decoders can be designed for different speaker arrays. The design of a decoder is based on approximations to what a centrally seated listener will receive, in terms of phase and level differences between the ears at low and high frequencies. This makes it an ideal choice for a system that can be converted to binaural and transaural reproduction. However, a number of issues are apparent: • The optimisation of a frequency dependant 1st order decoder for use with the ITU 5 speaker array has not been achieved with the technique of solving the non-linear simultaneous equations representing the velocity • • and energy vectors being both laborious and leading to non-ideal results. This process will only become more complicated when more speakers are added (Gerzon & Barton, 1992 and Gerzon & Barton, 1998). The energy and velocity vectors are low order approximations to the actual head related signals arriving at the ear of the listener. The analysis and design of Ambisonic decoders could, potentially, be improved through the use of head related data directly. - 71 - Chapter 3 3.3.2 Wavefield Synthesis 3.3.2.1 Theory Although this research concentrates on the Ambisonic form of speaker surround sound, it is not necessarily because it is the most realistic in its listening experience. One of the most accurate forms of surround sound (from a multiple listener point-of-view) is termed Wavefield Synthesis. In its simplest form Wavefield Synthesis is the system first tried by Bell Labs mentioned at the beginning of this chapter (Rumsey and McCormick, 1994); however, the theory and underlying principles of Wavefield Synthesis have been studied, the mathematical transfer functions calculated and a theoretical understanding of the necessary signal processing involved in such a system have been developed. The result is that individual sources can be synthesised, simulating both angular placement and distance (with distance being the cue that is, perhaps, hardest to recreate using other multi-speaker reproduction systems). Wavefield synthesis is different from most other multi-speaker surround sound systems in a number of ways: • It is a volume solution, that is, there is no ‘sweet spot’, with an equal • reproduction quality experienced over a wide listening area. • a difficult cue to simulate using other forms of multi-channel sound. Distance simulation is very well suited to Wavefield Synthesis. This is The resulting acoustic waves, rather than the source itself, are synthesised. Wavefield Synthesis (and the Bell Labs version before it) is based on Huygen’s principle1. Put simply this states that any wave front can be recreated by using any number of point sources that lie on the original wave. This implies that to recreate a plane wave (i.e. a source at an infinite distance from the listener) a line-array of speakers must be used, but to create a 1 The principle that any point on a wave front of light may be regarded as the source of secondary waves and that the surface that is tangent to the secondary waves can be used to determine the future position of the wave front. - 72 - Chapter 3 spherical wave (more like the waves heard in real life) an arc of speakers must be used. However, where Wavefield Synthesis’ innovation lies is that the necessary transfer functions have been calculated, and a line array of speakers can synthesise both of these situations using a mixture of time delays and amplitude scaling (a transfer function). It is often thought that Ambisonics is spherical Wavefield Synthesis on a lesser scale, and Bamford (1995) has analysed it in this way (that is, as a volume solution, looking at how the well the sound waves are reconstructed); however, this is not strictly the case as no time differences are recorded (assuming perfectly coincident microphone capsules), or necessarily needed, and so it is more accurate to think of Ambisonics as more of an amplitude panning scheme (albeit, one based on more solid foundations than simple pair-wise schemes). This also suggests that the results from Bamford (1995) that state that first order Ambisonics is only ‘correct’ up to 216Hz (in a sweet spot 25cm wide) may be a simplification (and under-estimation) of the system’s performance. In other words, this is a measure of an Ambisonics wavefield synthesis performance. Clearly, if Ambisonics only had a useable (spatially speaking) frequency of up to 216Hz, and a sweet spot 25cm wide, it would not be very useful for surround sound. So what is the limiting factor for Wavefield Synthesis? Due to the finite number of points used to recreate a sound wave, this system is limited by its ‘Spatial Aliasing Frequency’ (Berkhout et al., 1992). The equation for this (although note that this is for a plane wave) is given in Equation (3.11) (Verheijen et al., 1995). f Nyq = c 2∆x sin (θ ) (3.11) where: fNyq = Limiting Nyquist Frequency. θ = Speaker spacing. = Angle of radiation. c = ∆x Speed of sound in air (≈342ms-1) - 73 - Chapter 3 It must be noted that although Wavefield Synthesis has a limiting frequency, this is its Spatial Aliasing limit. That is, the system can reproduce sounds of full bandwidth, however, accurate reproduction can only be correctly achieved (theoretically) below this frequency (which is, incidentally, the reason Bell Labs early simplification of their original multi-mike, multi-speaker array did not work as hoped when the number of speakers was reduced). It can also be seen that the limiting frequency is inversely proportional to the angle of radiation. To understand the reasons behind this, an example is shown in Figure 3.27. Source w=1/f θ a Figure 3.27 ∆t=∆x.sin(θ) c ∆x b The effect that the angle of radiation has on the synthesis of a plane wave using Wavefield Synthesis. Once the angle of radiation is changed to an off-centre value (i.e. non-zero) then the amount of time delay that is needed to correctly simulate the plane wave is increased, proportional to the distance between the speakers multiplied by the sine of the angle, θ. Once this time delay becomes more than half the wavelength of the source the superposition of the wave fronts creates artefacts that manifest themselves as interference patterns (Verheijen et al., 1995). Filtering the transfer functions used to recreate the wave front (or using more directional loudspeakers (Verheijen et al., 1995)) counteracts this. - 74 - Chapter 3 3.3.2.2 Summary Wavefield Synthesis is reported as being one of the most accurate forms of multi-channel sound available, but it does have some problems that make it an undesirable solution for this project: • Huge amount of transducers needed to recreate horizontal surround sound (for example, the University of Erlangen-Nuremberg’s experimental setup uses 24 speakers (University of Erlangen- • Nuremberg, N.D) arranged as three sides of a square). • Wavefield Synthesis. The reproduction of three-dimensional sound is not yet possible using Recording a sound field for reproduction using Wavefield Synthesis is difficult due to the high rejection needed for each direction. • Synthesised material works much better (Verheijen et al., 1995). Large amount of storage channels and processing power needed to provide loudspeakers with appropriate signals. Also, there is not, as yet, a standard protocol for the storage and distribution of such material; although this is being worked on as part of the MPEG Carusso Project (Ircam, 2002). This lack of storage standard is not an issue, of course, for applications that calculate their acoustical source information on the fly, such as virtual reality systems. 3.3.3 Vector Based Amplitude Panning 3.3.3.1 Theory Vector based amplitude panning (or V.B.A.P.) is an amplitude panning law for two or three dimensional speaker rigs, and was developed by Ville Pulkki. Once the speaker positions are known, the V.B.A.P. algorithm can then be used to decode the speaker rig using pair-wise (two dimensions) or triple-wise (three dimensions) panning techniques. An example of the two dimensional algorithm is shown in Figure 3.28 (Pulkki, 1997). - 75 - Chapter 3 Source g1l1 Figure 3.28 g2l2 Graphical representation of the V.B.A.P. algorithm. As can be seen in Figure 3.28, horizontal V.B.A.P. divides the source into its two component gains, in the direction of the loudspeakers, which are then used as the gains for the amount of the source that it supplied to each of the speakers. It must be noted, however, that the sources are limited to existing on the path between speakers by normalising the gain coefficients g1 and g2. To extend the system to three dimensions, triple-wise panning is used. An example decode of a source travelling from an angle of 00 to an angle of 1200 is shown in Figure 3.29, along with the four un-normalised speaker gains. This system can work very well, mainly because the largest possible localisation error cannot be any more than one speaker away from where the source should be. However, as can be observed from Figure 3.29, a speaker detent effect will be noticed when a source position is in the same direction as a speaker as only that speaker will be replaying sound. This will create a more stable, and psychoacoustically correct virtual source (as it is now a real source) which will mean that the individual speakers will be heard with the sources potentially jumping from speaker to speaker if the spacing between the speakers it too great. - 76 - Chapter 3 Speaker Amplitude Source at 00 Source at 300 Source at 600 Source at 900 Source at 1200 Figure 3.29 Simulation of a V.B.A.P. decode. Red squares – speakers, Blue pentagram – Source, Red lines – speaker gains. 3.3.3.2 Summary VBAP is based around the simple pair-wise panning of standard stereo, although using the VBAP technique it can be easily used as a triple-wise, with-height system. To this end, a VBAP system comprising of a low number of speakers will suffer the same problems as other pair-wise panned systems (see Quadraphonics, section 3.2.7). However, as the number of speakers are increased, the accuracy of the system will improve, although side images will always suffer when compared to frontal images due to pair-wise panning techniques failing for speakers placed to the side of a listener (although the error will, again, lessen with increased speaker density). For this project, however, VBAP is unsuitable as: • VBAP has no storage format – all panning information is calculated when the material is replayed, as information regarding the speaker layout must • • be known. Any pre-decoded material can not have additional speaker feeds calculated according to the rules of VBAP. The decoded material is not optimised for a centrally seated listener, making the system sub-optimal if conversion to headphone or transaural systems is required. - 77 - Chapter 3 3.3.4 Two Channel, Binaural, Surround Sound Although all of the surround sound systems discussed so far have used more than two channels (many more, in some cases), it is possible to use only two channels. Such a system is termed binaural reproduction. As we only have two ears, then it seems reasonable that only two channels of audio are necessary to successfully fool the ear/brain system into thinking that it is experiencing a realistic, immersive, three dimensional sound experience. All of the speaker reproduction systems discussed so far have a number of marked limitations: • System performance is normally proportional to the number of • speakers used. The more speakers, the better the result. • involved task to control exactly what is being perceived by the listener. The sound from each speaker will reach both ears, making it a more The final system is usually a compromise due to the above limitations. Binaural sound circumvents these limitations with the use of headphones. As there is a one to one mapping of the ears to the transducers it is very easy to provide the ears with the signals necessary to provide convincing surround sound. Binaural sound reproduction works on the simple principle that if the ears are supplied with the same acoustical pressure that would have been present in real-life due to a real source, then the ear/brain system will be fooled into perceiving that a real source is actually there. As discussed in chapter 2, there are a number of auditory cues that the ear/brain system uses to localise a sound source, a number of which can be simulated using a head related transfer function (HRTF). An example pair of HRTFs are shown in Figure 3.30, and are taken from a KEMAR dummy head in an anechoic chamber by Gardner & Martin (1994). The source was at an angle of 450 from the centre of the head, and at a distance of 1 m. - 78 - Chapter 3 Figure 3.30 Pair of HRTFs taken from a KEMAR dummy head from an angle of 450 to the left and a distance of 1 metre from the centre of the head. Green – Left Ear, Blue – Right Ear. The three lateralisation cues can be clearly seen in this figure. These are: • • Amplitude differences – amplitude is highest at the nearer ear. Time differences – farther ear signal being delayed compared to the closer ear (seen in both the time domain plot, and the phase response • plot, by observing the larger [negative] gradient). Pinna and head filtering – as the sound has two different physical paths to travel to the ears, due to the pinna and the head, resulting in frequency dependent filtering (seen in the frequency response plot). It is the head related transfer function that forms the basis on which binaural sound reproduction is founded, although through the use of anechoic HRTF data alone, only simple lateralisation is possible. This will be discussed shortly. There are two ways in which to create a binaural reproduction, it can be recorded using in-ear microphones, or it can be synthesised using HRTF data. As far as the recording side of binaural sound is concerned, the theory is as simple as placing a pair of microphones into the ear of the recordist (or dummy head). The parts of the outer ear that filter the incoming sound wave are the pinna and the ear canal. If the recorded material is taken from a subject with an open ear canal (i.e. microphones placed in the ear of the subject) then the recording will possess the ear canal resonance, which lies at about 3 kHz (a 3 cm closed pipe has a fundamental resonant frequency of - 79 - Chapter 3 2850 Hz). Then, when the listener replays the recording over headphones, the recording will be subjected to another ear canal resonance, meaning that the musical content will be perceived as having a large resonance at around 3 kHz. This, therefore, must be corrected with the use of equalisation; although the blocking of the ear canal of the recordist prior to recording is another solution (Kleiner, 1978). The actual positioning of the microphones within the outer ear of the subject has an effect on the system where the most robust positioning of the microphone is usually found to be inside the ear canal (Ryan & Furlong, 1995), (although the blocking of the ear canal is not really a desirable solution to the last problem). There are two other difficulties in using recorded binaural material and they are pinna individualism and head movements. As discussed in Chapter 2, everyone’s pinnae are different, which in turn means that the complex filtering patterns that the pinnae apply to the incoming sound waves are also different. The binaural recording process means that the listener will be experiencing the sound field by listening through somebody else’s ears. The results of this will be discussed later in this section. When it comes to synthesising a binaural sound field, HRTF data is used. As the HRTF is a measure of the response of the ear due to a source, then it will suffer the same difficulties mentioned for the recorded material. However, some differences are apparent. The HRTF data used to synthesise sources is normally recorded in an anechoic chamber (Gardner and Martin, 1994) as this gives the greatest flexibility in source position synthesis as it is possible to add reverberation, but very difficult to take it away again. Also, HRTFs are usually recorded in pairs at a set distance from the centre of the head (say, one metre), but this is not necessarily the most versatile solution. As a demonstration of this, consider the situation shown in Figure 3.31. - 80 - Chapter 3 Source HRTF Directions from 1 metre 1 Metre Listeners Ears Figure 3.31 Example of a binaural synthesis problem. If distance is to be simulated correctly, then recording and storing the HRTFs in pairs centred on the head actually complicates the situation. This is because the pair of HRTFs will have an amplitude difference, time difference, and pinna filtering that is not only due to the angle of incidence of the source, but also its distance, as discussed in Chapter 2. This means that if a source is to be synthesised at a distance that is different than the one that was measured then the point at which the source intersects the measured distance needs to be obtained. Extra delay also needs to be added to the HRTF filters, with a different value added to the left and right HRTFs. This adds extra, avoidable, calculations to the synthesis model, and is undesirable in real-time applications. To combat this problem it is far better that the HRTFs be recorded taking each ear as the centre point for the measurements as this means that the angle from source to each of the listener’s ears needs to be calculated, which is simpler than the scheme detailed above (although extra delay does still need to be added for each response separately). Once the problem of angle of incidence has been resolved (with one of the two methods suggested above) then one of the main advantages of binaural theory can come into play, and that is the simulation of distance cues. However, obtaining sources that are localisable outside of the head (i.e. not just resulting in source lateralisation) is not usually possible using anechoic - 81 - Chapter 3 simulation of the source (McKeag & McGrath, 1997). This, in some respects, is to be expected, as one of the psychological effects of being in an anechoic chamber is that sources tend to be perceived much closer than they actually are. One of the mechanisms that the brain utilises in the perception of source distance is in the direct to reverberant ratio of sounds (see Chapter 2). Sounds that are very close to the head have a very low (if any) reverberation perceived with them, so if a sound is heard in an anechoic chamber then the brain may assume that this source is close to us because of this. However, when listening to synthesised binaural sources it is unlikely that true, or even any, distance information will be perceived. This is due, mainly, to the reasons given below: • In nearly all listening situations the ear/brain system uses small head rotations to resolve the position of a source within the cone of • confusion. The shape and, therefore, filtering of the sound due to the pinna of the recording subject will be different than that of the listener. A number of people (including Moller et al., 1996) suggest that individualised HRTFs are needed for the accurate reproduction of binaural sound, while others suggest that head tracking is the most important aspect of the localisation process (Inanaga et al., 1995). However, it can be seen that neither or these are necessarily needed, and depth perception can be achieved by creating multiple, coherent auditory cues for the listener (McKeag & McGrath, 1997). Again, depending on the application, there are two methods of achieving this. Firstly, for the simulation of sources that are in a fixed position, the HRTFs can be measured in a real room, thereby recording the room’s actual response to a source, in this position, at the two ears of a subject. This, when convolved with the source material, will create the illusion of a source outside the head of the listener (McKeag & McGrath, 1997). Secondly, if dynamic source movement is needed, such as in 3D gaming, and virtual reality applications, then a model of the room in which the source is placed must be realised separately from the source, and then all of the images synthesised using anechoic HRTF data. The binaural synthesis of material in this way can lead to a very convincing surround sound experience - 82 - Chapter 3 using a limited number of channels, which is probably why all 3D computer gaming cards use this form of modelling. As mentioned in Chapter 1, it is now widely recognised that binaural headphone reproduction techniques can be used as a method of auralising multi-speaker arrays. This technique was pioneered by Lake DSP (for example, see McKeag & McGrath (1997) and McKeag & McGrath (1996) as an example of their later work), and more recently has been used by others (for example, see Leitner et al., 2000 and Noisternig et al, 2003) as a method of simulating both discrete speaker feeds and, in the case of Ambisonics, realising an Ambisonic decoder efficiently as three or four HRTF filters (see Chapters 4 and 5 for more details on this). Interestingly, although three of the four papers mentioned above discuss Ambisonics to binaural conversion, none use psychoacoustically optimised decoders as discussed in section 3.3.1.2. This will result in sub-optimal lateralisation parameters being reproduced at the listeners ears, as shown in the non-optimised decoders discussed in section 5.2. 3.3.5 Transaural Surround Sound Transaural surround sound techniques were first proposed in the 1960’s by Atal, Hill and Schroeder (Atal, 1966) and, although based on a relatively simple and understandable principle, were difficult to realise at this time. Transaural sound is a process by which Binaural reproduction can be realised over loudspeakers. Loudspeaker reproduction differs from headphone reproduction in that the sound from one loudspeaker reaches both ears (a fact that is the basis of Blumlein’s stereo reproduction technique, see earlier in this chapter), and binaural reproduction over headphones relies on the fact that the signal from one transducer only reaches one ear, that is, there is no crosstalk between the ears of the listener. The Transaural system is easier to explain if the following problem is considered. If a pulse is emitted from one of a pair of loudspeakers, what must happen for that pulse to only appear at one ear of the listener? This situation is shown in Figure 3.32, but is simplified by taking each ear as a microphone in a free field (i.e. no filtering of - 83 - Chapter 3 the sound will be present due to the head of the listener). Each of the two speakers are equidistant from the centre of the two microphones, and subtend an angle of 60 degrees (+/- 300). Mic2 Figure 3.32 M ic1 Graphical representation of the crosstalk cancellation problem. It can be noted that Mic1 receives the pulse first, closely followed by Mic2 which receives the same pulse, except that the amplitude has attenuated and it arrives later in time due to the extra distance travelled. In order to cancel the sound arriving at Mic2, the left loudspeaker can be made to emit a sound so that the same amplitude as the signal arriving at Mic2 is achieved, but inverted (1800 out of phase) as shown in Figure 3.33. This signal now cancels out the first sound picked up by Mic2 (see the microhpones response to each speaker’s output in Figure 3.33), but then the crosstalk produces another signal, again amplitude reduced, at Mic1. So another, amplitude reduced and phase inverted, signal is produced from the right loudspeaker to counteract the Mic1 crosstalk signal, and so on. As the amplitude of these pulses is always diminishing, a realisable and stable filter results, as shown in Figure 3.34. Also shown in Figure 3.34 is the block diagram for a typical implementation of a crosstalk cancellation system, note that this system will crosstalk cancel for both speakers, that is, the Left input signal will only appear at Mic2 and the Right input signal will only appear at Mic1. These two filters can be realised using a pair of I.I.R. filters1. However, this structure is not used, in practice, as the response of the listener’s head is not taken into account and so this form of crosstalk cancellation will be sub-optimal. 1 Infinite Impulse Response filters using a feedforward/back loop and attenuating gain factors (typically). - 84 - Chapter 3 Figure 3.33 Simulation of Figure 3.32 using the left loudspeaker to cancel the first sound arriving at Mic2. A graph showing Two Free Field Dipole Filters 1 0.8 Amplitude H1 0.6 0.4 Left Speaker 0.2 0 0 50 100 150 200 250 Time (samples) 300 350 400 450 Left 0 Right -0.2 Amplitude Right Speaker -0.4 -0.6 H2 -0.8 -1 0 50 100 Figure 3.34 150 200 250 Time (samples) 300 350 400 450 Example of free-field crosstalk cancellation filters and an example implementation block diagram. Although this particular filtering model would never be used in practice, it will be used here to demonstrate the type of frequency response changes that occur due to the crosstalk cancellation filtering process. In theory, of course, the sounds heard at the two microphone positions will be as desired, but for off centre listening (and also, to some extent, listening in the sweet spot in a non-anechoic room) will have a response similar to that shown in Figure 3.35. - 85 - Chapter 3 Although this seems slightly irrelevant for crosstalk cancellation filters designed with HRTF data, it does show some of the extreme filtering that can occur due to the system inversion process. Figure 3.35 Frequency response of free field crosstalk cancellation filters The process above, described as filter inversion is, in fact, slightly more complicated than this. Although the example above (crosstalk cancellation in the free field) is a good starting point for gaining an understanding of the processes involved in crosstalk cancellation algorithms, the equation has not yet been defined. If we again look at the problem shown in Figure 3.36, it can be seen that, for a symmetrical setup, only two transfer functions are present, c1 – the response of the microphone to the near speaker, and c2 – the response of the microphone to the far speaker. v2 v1 c1 Mic2 Figure 3.36 c2 c2 c1 M ic1 The Crosstalk cancellation problem, with responses shown. The relationship between the signals emanating from the speakers, and what arrives at the two microphones is given in Equation (3.12). - 86 - Chapter 3 ⎡ Mic1 ⎤ ⎡ c1 ⎢ Mic 2⎥ = ⎢c ⎦ ⎣ 2 ⎣ c2 ⎤ ⎡ v1 ⎤ ⋅ c1 ⎥⎦ ⎢⎣v2 ⎥⎦ (3.12) Therefore, if we wish to present to the system the signals that we wish to receive at Mic1 and Mic2, then the inverse of the transfer function matrix needs to be applied to the two signals, prior to transmission (Nelson et al., 1997) (which is what is happening in the system described in Figure 3.34) and is shown in Equation (3.13). The simplification to two filters, h1 and h2 can be made due to the crosstalk cancellation meaning that the signal at Mic2 will be forced to zero and the signal at Mic1 will be the desired signal at unity gain. ⎡ c1 ⎡ v1 ⎤ 1 ⎢v ⎥ = (c × c ) − (c × c ) ⎢− c 2 ⎣ 2⎦ 1 1 2 2 ⎣ v1 = v2 = c1 c1 − c 2 2 2 c1 c1 − c 2 2 2 ⋅ Mic1 + − c2 c1 − c 2 ⋅ Mic2 + 2 − c2 ⋅ Mic2 2 c1 − c 2 2 − c 2 ⎤ ⎡ Mic1⎤ ⋅ c1 ⎥⎦ ⎢⎣ Mic2⎥⎦ 2 ⋅ Mic1 ⇒ h1 = h2 = c1 c1 − c 2 2 − c2 2 c1 − c 2 2 2 (3.13) where: v1 & v2 are the speaker signals shown in Figure 3.36 c1 & c2 are the transfer functions from Figure 3.36. h1 & h2 are the transfer functions used in Figure 3.34. The final filters are shown in Equation (3.14) (the multiplying of c12 + c22 to both the numerator and denominator of the equation is also shown for compatibility with the frequency dependent inversion procedure) and is carried out in the frequency domain, adapted from Farina, et al. (2001), as inverting this system in the time domain can take a long time, even on fast computers. As an example, the calculation of the these filters in the frequency domain, using Matlab® and a filter size of 1024 points takes less than a second, however, using time domain signals coupled with the simple multiplications and divisions turning into convolutions and de-convolutions means that the same algorithm can take around half an hour to complete. - 87 - Chapter 3 ⎛ c12 + c22 ⎞ ⎟ h1 = c1 × ⎜⎜ 4 4 ⎟ ⎝ c1 − c2 ⎠ where: ⎛ c12 + c22 ⎞ ⎟ h2 = −c2 × ⎜⎜ 4 4 ⎟ ⎝ c1 − c 2 ⎠ c1 & c2 are the transfer functions from Figure 3.36. h1 & h2 are the transfer functions used in Figure 3.34. (3.14) It must also be noted that Equation (3.14) shows the inversion procedure for the symmetrical case (that is, the diagonals of the transfer function matrix are identical), and is not the general solution for this problem. Now that the mathematical equation has been defined, any transfer function can be used for c1 and c2 and a non-free field situation simulated. For example, if two speakers were spaced at +/- 300, as in a normal stereo triangle, then the corresponding crosstalk cancellation filters will be the same as shown in Figure 3.37. Figure 3.37 Transfer functions c1 and c2 for a speaker pair placed at +/- 300, and their corresponding crosstalk cancelling filters. As can be seen in the right hand graph of Figure 3.37, the crosstalk cancellation filters actually have samples that are valued greater than one (which denotes potential clipping in many audio applications); however, in this case, they will not clip themselves (so long as storing these filters is not a problem). Nevertheless, when they are applied to a signal, much amplification will arise. The frequency responses of the two crosstalk cancellation filters are given in Figure 3.38. - 88 - Chapter 3 Figure 3.38 Frequency response of the two speaker to ear transfer functions (c1 & c2) and the two crosstalk cancellation filters (h1 & h2) given in figure 3.31. It can clearly be seen that any dip in the response of the original transfer functions, c1 and c2, creates an almost corresponding boost in the inverse response (this sounds obvious, but h1 and h2 are not the inverse of c1 and c2 directly). In this case, the response is particularly troublesome at around 8 kHz, very low and very high frequencies. This is due partly to the ears’ response (pinna etc.), the speaker response and the anti-aliasing filters in the recording of the HRTF responses respectively. To alleviate this problem a technique known as ‘frequency dependent regularisation’ has been developed (Kirkby et al., 1999). As the peaks in the crosstalk cancellation filters are due to the filter inversion at a particular frequency, making the inversion ‘suboptimal’ at these frequencies will flatten out the response at these points. The crosstalk cancellation equations using frequency dependent regularisation are given in Equation (3.15) (all transfer functions have been converted into the frequency domain). ⎛ c2 + c2 ⎞ h1 = c1 × ⎜⎜ 4 1 4 2 ⎟⎟ ⎝ c1 − c2 + ε ⎠ ⎛ c2 + c2 ⎞ h2 = −c2 × ⎜⎜ 4 1 4 2 ⎟⎟ ⎝ c1 − c2 + ε ⎠ (3.15) where: c1 & c2 are the transfer functions from figure 3.30. h1 & h2 are the transfer functions used in figure 3.28. ε is the frequency dependant regularisation parameter (0 – full inversion, 1 – no inversion) - 89 - Chapter 3 Figure 3.39 shows the effect on the frequency response of the two crosstalk cancellation filters using a regularisation parameter of 1 above 18 kHz. If the responses of c1 and c2 are observed (from Figure 3.38) it can be seen that having a regularisation parameter of 1 actually causes the resulting crosstalk cancellation filters to be the convolution of c1 and c2, which is why the high frequency roll-off is actually steeper in h1 and h2 than in c1 and c2. Figure 3.39 The regularisation parameter (left figure) and its effect on the frequency response of the crosstalk cancellation filters h1 & h2 (right figure). Using this regularisation parameter, the response of the system can be tailored so that clipping is avoided, at the expense of sub-optimal cancellation at these frequencies. Figure 3.40 shows the crosstalk cancellation of a pulse emitted from the left speaker both with and without regularisation applied. The corresponding speaker feeds after the crosstalk cancellation filters have been applied so as to simulate the signals received by a listener. - 90 - Chapter 3 With Regularisation Without Regularisation Figure 3.40 Simulation of crosstalk cancellation using a unit pulse from the left channel both with and without frequency dependent regularisation applied (as in Figure 3.39). Assuming that any value greater than one will cause clipping of the signal then it can be clearly seen that when regularisation is applied to the crosstalk cancellation filters the system outputs much lower signals while still maintaining almost the same signal level at the ears of the listener (it must be noted that in this simulation the same HRTF data was used for both the - 91 - Chapter 3 simulation and the calculation of the crosstalk cancellation filters, and this will not be true in a real-life situation). Apart from the frequency dependent regularisation parameter introduced above, much of the theory behind Transaural sound reproduction has not changed since its invention in 1962 (Atal, 1966). However, spacing the speakers as a standard stereo pair meant that the sweet spot (the area where crosstalk cancellation occurs) is small and very susceptible to errors due to head movement. To combat this, researchers at Southampton University discovered that this problem, and to a certain extent, that of excessive signal colouration, could be alleviated by moving the speakers closer together to span around 100. If a small speaker span is used then the area of successful crosstalk cancellation becomes larger as a line of crosstalk cancellation is created. This means that the position of the listener with respect to the distance from the loudspeakers is not so important, making the system more robust. Also, to demonstrate the signal colouration changes we will again consider the system shown in Figure 3.36. As the angular separation of the speakers becomes smaller, the more identical the transfer functions between each ear and the speakers (particularly at low frequencies) and hence, the greater the amplitude of the cancellation filters at these frequencies. This means that the angular separation of the speakers is limited by the amount of boost that must be applied to the low frequencies of the system (assuming regularisation is not used). An example of filters taking into account the HRTF of the listener is shown in Figure 3.42. This, to some extent, shows the ‘swings and roundabouts’ situation that can occur when dealing with the speaker placement of a Transaural system. Moving the speakers closer together makes for a more robust system, and moves much of the sound colouration into a higher frequency range, but creates a wider range of bass boost, which speakers generally find more difficult to recreate. Optimisation of this technique to alleviate some of these problems will be discussed in Chapter 5. - 92 - Chapter 3 Figure 3.41 Example of the effect of changing the angular separation of a pair of speakers used for crosstalk cancellation. - 93 - Chapter 3 Figure 3.42 Example of the effect of changing the angular separation of the speakers using HRTF data. 3.3.6 Ambiophonics The methods for recreating surround sound described above cover the current state of the art; however, there are now a number of emerging techniques that combine the virtues of more than one of these techniques in order to improve upon the usefulness of any one of these theories. Such a system is Ambiophonics (Glasgal, 2001). Ambiophonics differs from most of the systems described above as it does not attempt to be a general solution; that is, it is only designed for the listening of recorded material in a concert hall. It tries to recreate the ‘I am there’ situation. Ambiophonics is really a hybrid of binaural/transaural reproduction coupled with a more psychoacoustically correct reverb algorithm, so as to fool the ear/brain system into thinking that it is immersed within a real hall. However, this is also, to a certain extent, the remit for the Ambisonics system, so what are the main differences? The main difference is that Ambisonics uses a generic panning law so as to give equal priority (or localisation quality) to every direction, whereas Ambiophonics always assumes that the stage is in front of the listener and the ambience will be all around the listener. Therefore Ambisonics is a much more general surround sound solution, whereas Ambiophonics is limited in this way. However, due to this limiting factor a number of issues can be addressed. The front stage signal is recorded using (ideally) a pinna-less dummy head microphone (however, any stereo recording method will work, to some extent (Glasgal, 2001)). Also, it is a good idea to limit the amount of rear/side reflections that reach these microphones (which is normally done for stereo recordings, anyway, in order to avoid a - 94 - Chapter 3 recording that is too reverberant (Glasgal, 2003c)). Limiting the rear and side reflections picked up by this stereo recording is necessary due to the fact that these signals will be generated using convolution during the decoding stage. This stereo signal can then be replayed using a crosstalk cancellation system such as the system described in section 3.3.5. The surrounding ambience is then created and distributed using a number of speakers surrounding the listener. The main innovation here is that each speaker represents an early reflection direction. This means that, as these early reflections are being emitted from an actual source (rather than a panned position), all of the psychoacoustic cues associated with the angular directional aspect of these reflections will be absolutely correct, including the pinna cues, which are almost impossible to replicate using any other system (except Wavefield Synthesis). A typical layout for such a system is shown in Figure 3.43. Figure 3.43 Example Ambiophonics layout. As the crosstalk cancelled pair of speakers (typically set at +/- 50, which means multiple listeners sat in a line can experience the system) is reproducing the frontal hemisphere of the concert hall, fewer speakers are needed in front of the listener. The surround speakers are then fed with the stereo signal convolved with a stereo pair of impulse responses which contain no direct sound, a number of discrete reflections (one or more) and a diffuse, - 95 - Chapter 3 uncorrelated (compared to the other speakers) tail. The speakers need not be in an exact position as no exact inter-speaker imagery is to be taken into account; in fact, repositioning the speakers until the most desirable response is found is a good technique for the creation of the best sounding concert hall. Using the Ambiophonics technique many of the cues needed for the localisation of sound and perception of a real space are met, with particular attention paid to the accuracy of the reverberation. That is not to say that the system must sound exactly like a real hall, but that the auditory cues present in the reverberation of the material are psychoacoustically very accurate and will sound like a realistic hall. 3.4 Summary In this chapter, a number of techniques for the recording and reproduction of spatial sound have been investigated and discussed. It must be noted that the most popular panning algorithm, as far as the ITU 5 speaker layout is concerned, is a version of the V.B.A.P. algorithm, or pair-wise panned system. This method can work very well for frontal sources. However, at the sides of the listener, it has been shown (Gerzon, 1985) that pair-wise panning does not work correctly, with the ear/brain system finding it very difficult to decode such a system. This causes ‘holes’ in the recreated sound field, which is not too detrimental for film material, which is the medium this layout was designed for (as most material will come from the front, with occasional effects or ambience using the rear speakers). Also, it is not a particularly well defined system in that there is no agreed technique in the recording of pair-wise panned material, and recording for the ITU 5 speaker layout is quite often based upon extended Decca Tree arrangements (Theile, 2001) for a number of reasons: • The decorrelation of low frequency components is thought to be very important in the perception of spaciousness in a sound field. Spacing the microphones that feed the array almost guarantees this • decorrelation. The precedence effect can only be simulated using spaced microphone techniques. This is not to say that coincident microphone techniques - 96 - Chapter 3 do not encode phase information (see Chapter 3), they just cannot represent time of arrival differences correctly as the microphone picks up sound from one point in space (theoretically). However, these techniques do not lend themselves well to different speaker arrangements (that is, they are not hierarchical based formats), and now, as the media and technology for multi-channel sound reproduction is becoming more readily available, the industry is starting to realise that they do not want to rerecord/remix an album every time a new speaker layout is presented to them. For this reason this research focuses on the Ambisonics system, which is the only hierarchical system defined at this moment in time (although MPEG-4 is now being specified to address this, to some extent (MIT Media Lab, 2000)). If Ambisonics hierarchical system is used as a carrier format (in its 1st, 2nd or higher order variants) then the system can be decoded for any multi-speaker system. However, currently, a number of limitations are present using this system: • Although Gerzon and Barton (1992) suggested a number of optimisation equations for use with irregular speaker arrangements, the equations are difficult to solve, and so no further research seems to have been published in this area giving optimal coefficients for use with • the standard ITU five speaker layout. Although a method of converting Ambisonics and five speaker ITU surround sound to binaural reproduction has been suggested by McKeag & McGrath (1996 & 1997 respectively), no work has been carried out on the optimisation of these multi speaker systems in order to reproduce the correct psychoacoustic cues at the ears of the listener. This has been shown to be a trivial optimisation for a regular speaker array, but will rely on the work mentioned in the point above for the optimal auralisation of material if distributed on a medium • carrying the standard 5.1 channels as specified by the ITU standard. Only a handful of software utilities for the encoding and decoding of Ambisonic material is available (McGriffy, 2002), and no psychoacoustically correct decoding software for irregular arrays exists. - 97 - Chapter 3 These current limitations will be addressed in the following chapters of this thesis. - 98 - Chapter 4 Chapter 4 - Development of a Hierarchical Surround Sound Format 4.1 Introduction Although many surround sound decoding techniques are available, a number of problems are evident. For the majority of multi-speaker presentations, the material is composed specifically for a particular speaker layout, and Binaural/Transaural systems suffer from this same, inherent, problem. This does not, of course, create a problem initially, but as soon as the speaker layout becomes obsolete, or a Binaural or Transaural production needs to be replayed on a multi-speaker platform, a complete reworking of the sound piece is needed. For these reasons, this chapter will concentrate on the description of a hierarchical surround sound format, based on an amalgamation of currently available systems, in order to maximise the number of replay situations that the system is capable of satisfying. The benefits of this system are: • • The created piece will be much more portable in that, as long as a decoder is available, many different speaker layouts can be used. The recordings will become more future-proof as, if a speaker layout changes, just a re-decode is needed, rather than a whole remix of the • piece. The composition/recording/monitoring of the piece will become more flexible as headphones, or just a few speakers can be used. This will result in less space being needed. This is particularly useful for onlocation recordings, or small studios, where space may be limited. 4.2 Description of System Such a system can be described diagrammatically as shown in Figure 4.1. - 99 - Chapter 4 n-speaker output decoder n-channel carrier Recorded/ Panned Signals Encoding Block Sound-field Manipulations. Rotations etc. 2-speaker transaural decoder 2-channel binaural decoder Figure 4.1 Ideal surround sound encoding/decoding scheme. As can be seen in Figure 4.1, this ideal surround sound system should conform to the following criteria in order to maximise its flexibility and usefulness: • A hierarchical carrier signal should be used. That is, a carrier system should be able to be understated (channels ignored, reducing localisation accuracy) or overstated (extra channels added later, • increasing localisation accuracy). • i.e. rotations about the x, y and z axis etc.. This encoded signal should be able to be manipulated after encoding, The encoded signal should be able to be easily replayed over multiple listening situations including: o A number of different speaker arrangements, as almost no-one can place their speakers in the ITU or future speaker positions. o Over headphones. o Over a standard stereo pair (and other placement widths) of • speakers. Efficient means of transferring from the carrier to one of the above systems. If we take the current ‘state of the art’ surround standard as an example, and try to apply the above criteria to it, a number of shortcomings can be observed. In Dolby Digital 5.1, the carrier signal is six discrete channels, each one representing a speaker signal directly. Each speaker is assumed to be at the speaker locations specified in the ITU standard as shown in Figure 4.2. - 100 - Chapter 4 L R C 60 SL 80 80 SR 140 Figure 4.2 Standard speaker layout as specified in the ITU standard. To listen to this system over headphones is not a difficult task and has been achieved by a number of companies (Mackerson et al., 1999; McKeag & McGrath, 1997). It is achieved by binaurally simulating speakers using HRTF data, and replaying the resulting two channels over headphones. As discussed in Chapter 3, the binaural reproduction of surround sound material needs to contain some form of psychoacoustically tangible reverb involved if a realistic, out-of-head experience is to be delivered. When auralising 5.1 surround two approaches can be taken. The first approach assumes that the 5.1 surround system is trying to simulate an acoustic space where each speaker can be rendered using a pair of anechoic HRTFs, normally between 128 and 1024 samples in length. This approach will rely on the 5.1 decode to supply the ear/brain system with the appropriate reverberation, and is the most computationally efficient solution. However, the qualities and amount of the reverberation used on each recording may be psychoacoustically confusing and, therefore, not convincing enough to promote the out-of-head imaging possible with the binaural approach. The better approach (and the one used by Lake (McKeag & McGrath, 1997) and Stüder (Mackerson, et al., 1999)) is where the speakers are simulated in a ‘good’ listening room, that is, each speaker will have its own reverb associated with it, on top of anything that is already recorded within the surround sound material. This can be done in one of two ways: • Simulate the individual speakers using a pair of head related transfer functions per speaker, and then simulate the listening room using a - 101 - Chapter 4 binaural reverb algorithm (perhaps using discrete first order room reflections, again a pair of HRTFs per reflection, followed by a short, • diffuse tail). Simulate the individual speakers and room together using a much longer pair of head related transfer functions per speaker. The decision of which of the two approaches to use is really a question of processing power available. The difference in efficiency between the two methods can be quite high depending on the implementation used. Ideally the second method would be used, as this would provide a closer match to a real environment, and therefore maximising the performance of the binaural decode. This method has been shown to work very well, especially when carried out with head-tracking (Makerson, et al., 1999), although a good interpolation algorithm is then needed to stop the creation of clicks and pops due to the changing filter structures (in fact, the development and implementation of interpolation algorithms can be the most time consuming part of such a piece of professional audio hardware). Once the binaural version has been created it is then a relatively easy task to convert this recording for a 2 speaker, transaural reproduction by using a 2 x 2 matrix of correctly designed crosstalk cancellation filters. However, what if the (real) speakers were not placed in the correct, ITU specified, positions in the listening room? Calculating new speaker feeds for a system that is defined by discrete channels is not necessarily an easy task (Gerzon, 1992a) when the encoding system cannot necessarily be assumed to be simple pair-wise panning. A better technique would be to use Ambisonic B-format, or similar, to drive the system, or at least use a standard B-format decoding algorithm to derive the 6 discrete channels on a DVD and then, if desired, work out the B-format signals from these speaker feeds. Using a hierarchical carrier, such as Bformat would result in the advantages given at the start of this section. - 102 - Chapter 4 For example, if we were to take horizontal only B-format as the carrier signal then decoding this B-format carrier for the various different presentation methods can be carried out as shown in Equation (4.1) (it should be noted ( ) 2 × W + (cos(θ ) × cos(φ ) × X ) + (sin(θ ) × cos(φ ) × Y ) + (sin(φ ) × Z ) that this is a sub optimal decoder but this will be discussed in Chapter 5). S n ( n) = where Sn is the signal sent to the nth speaker positioned at azimuth θ and (4.1) elevation φ. This simple decoding would produce the virtual microphone configuration shown in Figure 4.3. Figure 4.3 Virtual Microphone Configuration for Simple Ambisonic Decoding 4.3 B-Format to Binaural Reproduction All multi-speaker formats can be converted to a Binaural signal, but B-Format to binaural conversion can be achieved very efficiently due to its hierarchical nature. The system can be summarized as shown in Figure 4.4. W X Y Figure 4.4 Left Ear HRTF Simulation Ambisonic Decoder Horizontal B-Format to binaural conversion process. - 103 - Right Ear Chapter 4 As the system takes in 3 channels of audio and outputs two channels of audio, the actual Ambisonic decoding process can be contained within a pair of HRTFs representing each of W,X and Y. This means that any number of speakers can be simulated using just six HRTFs (three pairs). The equations describing this process for an eight speaker array are given in Equation (4.2). W hrtf = ( 2 )× ∑ (S ) ( ) = ∑ (sin (θ )sin (φ ) × S ) = ∑ (cos(φ ) × S ) 8 X hrtf = ∑k =1 cos(θ k )sin (φ k ) × S khrtf k =1 hrtf k 8 Y hrtf Z hrtf 8 k =1 k 8 k =1 k k hrtf k hrtf k (4.2) Where θ = source azimuth φ = source elevation (0 for horizontal only) Skhrtf = Pair of HRTFs measured at speaker position, k. The signals then required to be fed to each ear are given in Equation (4.3). ( ) ( ) ( ) ) + (X ⊗ X )+ (Y ⊗ Y ) Left = W ⊗ WLhrtf + X ⊗ X Lhrtf + Y ⊗ Y Lhrtf ( Right = W ⊗ WRhrtf hrtf R hrtf R (4.3) Another optimisation that can be applied is that of assuming a left/right symmetrical room. For example, if the B-Format HRTFs shown in Figure 4.5 are studied it can be seen that both the left and right W HRTFs are the same, the left and right X HRTFs are the same, and the left and right Y HRTFs are the same, but phase inverted. So, in this symmetrical case only three HRTFs are needed to simulate a multi-speaker Ambisonic system with the new Left and Right ear feeds given in Equation (4.4). ( ) ( ) ( ) ) + (X ⊗ X ) − (Y ⊗ Y ) Left = W ⊗ W hrtf + X ⊗ X hrtf + Y ⊗ Y hrtf ( Right = W ⊗ W hrtf hrtf hrtf (4.4) - 104 - Chapter 4 W X Y Figure 4.5 Example W, X and Y HRTFs Assuming a Symmetrical Room. As can be seen from Equation (4.4), a symmetrical room will result in a total of three convolutions to be computed, as opposed to six for an unsymmetrical room, resulting in a 50% processing time saving (and, incidentally, this compares very favourably to the ten convolutions needed to auralise a standard five speaker when not driven by B-format). Once the material has been ‘binauralised’, a two speaker Transaural presentation can then be created with the use of standard crosstalk cancellation filters. For a four speaker configuration two options are available. • If the speakers are arranged in a near square formation as shown in Figure 4.6, then the B-format signal can be decoded Ambisonically to feed these four speakers. - 105 - Chapter 4 • If the speakers are arranged so that the speakers are placed close together (e.g. either side of a computer monitor) as shown in Figure 4.7, then a double crosstalk cancellation system would be best suited. Both options can be utilised for most four speaker configurations, these two figures (Figure 4.6 and Figure 4.7) just show the ideal setup for each system. The system chosen would be dependant upon the listening situation and processing power available. A four speaker crosstalk cancellation has the advantage over a two speaker crosstalk cancellation system in that both front and rear hemispheres can be reproduced creating a more accurate, enveloping sound with much less noticeable front/back ambiguity, particularly if the speakers are arranged in a manner similar to Figure 4.7. This system, however, although delivering much better results than frontal crosstalk cancellation alone, is, potentially, the most processor intensive of all of the reproduction methods described in this report (although it will be shown, in Chapter 6, that this is not always the case). It can be seen from the block diagram shown in Figure 4.8 that this method of reproduction will require twice as many FIR filters than frontal crosstalk cancellation alone. Figure 4.6 Ideal, 4-Speaker, Ambisonic Layout Figure 4.7 Ideal Double Crosstalk Cancellation Speaker Layout - 106 - Chapter 4 W X Y Figure 4.8 Front Ambisonic Decoder HRTF Simulation (3 FIRs) Front Crosstalk Cancellation (4 FIRs) Rear Ambisonic Decoder HRTF Simulation (3 FIRs) Rear Crosstalk Cancellation (4 FIRs) To Front Left Speaker To Front Right Speaker To Rear Left Speaker To Rear Right Speaker Double Crosstalk Cancellation System The dual crosstalk cancelling system described by Figure 4.8, or the two speaker crosstalk cancellation system, can be made more efficient by changing the length of a number of the FIR filters when converting the Bformat carrier to the Binaural signal since, as was mentioned above, nonanechoic HRTFs were utilised in order to help sound externalisation. When replaying binaural material over a crosstalk cancellation system, this is not necessary, as the sound will normally be perceived at a distance equal to the distance of the speakers. This can be observed by playing unprocessed, stereo material over a crosstalk cancelled system. In such a situation the sounds are perceived as coming from a hemisphere around the front of the listener as shown in Figure 4.9. Therefore, longer HRTFs that include some form of room response are not needed during the B-format to binaural conversion stage (as out of head localisation is already present), reducing the size of the HRTFs from over 8192 points to less than 1024 as shown in Figure 4.10, making B-format to Transaural conversion in real-time a viable option for most modern processors. Panned Full Left Figure 4.9 Panned Full Right Perceived localisation hemisphere when replaying stereophonic material over a crosstalk cancelled speaker pair. - 107 - Chapter 4 The four-speaker transaural system is particularly well suited to this type of speaker simulation system as standard binaural material (that is, recorded as two channels) cannot successfully be replayed on a four speaker Transaural system. It is obvious that once a binaural recording has been made, it can be played back over both the front and rear pairs of a four speaker, crosstalk cancellation system, but it is then up to the listener’s ear/brain system to decide which sounds are coming from the front or the back as the same signal must be replayed from both crosstalk cancelling pairs, unless a ‘four ear’ dummy head recording is used. This gives many conflicting cues due to the imperfect manner in which Transaural systems crosstalk cancellation occurs. However, using the system mentioned above, total separation of the front and rear hemisphere’s audio is possible resulting in a much less ambiguous listening situation, where the best possible use of each pair of speakers can be realised. Reverberant HRTFs 0.4 Right Ear Left Ear Amplitude 0.2 0 -0.2 -0.4 -0.6 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Anechoic HRTFs 1 Amplitude Right Ear Left Ear 0.5 0 -0.5 Figure 4.10 0 20 40 60 80 100 Sample Number (sampled at 44.1kHz) 120 140 Example of Anechoic and non-Anechoic HRTFs at a position of 300 from the listener. All of the above equations assume that the carrier signal for this hierarchical system is first order B-format. However, as DVD players already expect to see six channels, this is not the best use of the already available outputs. Ideally, a 2nd Order Ambisonic carrier would be used. - 108 - Chapter 4 W Z X U Figure 4.11 Y S R T V Spherical Harmonics up to the 2nd Order. Second order Ambisonics, as mentioned in Chapter 3, would consist of nine channels to fully represent the three dimensional sound field: the four channels of 1st Order B-format, plus another five channels representing the sound field’s 2nd Order components (as shown in Figure 4.11). The use of these extra harmonics increases the directionality of the virtual pickup patterns that can be constructed by combining the signals in various proportions. Figure 4.12 shows the difference between a 1st and 2nd order virtual polar pattern. At the present time, the ITU standard specifies 6 full bandwidth audio channels (note that even the .1 channel is actually stored as full bandwidth on the DVD Audio and Super Audio CD disks), and so a standard to be adopted that uses to a maximum of six channels would be preferable. - 109 - Chapter 4 90 1 120 60 2nd Order 1st Order 0.8 0.6 150 30 0.4 0.2 180 0 330 210 240 300 270 Figure 4.12 2D polar graph showing an example of a 1st and 2nd order virtual pickup pattern (00 point source decoded to a 360 speaker array). The most logical way of achieving this is by specifying the horizontal plane to 2nd order resolution and the vertical plane to 1st order, resulting in a total of 6 channels (W, X, Y, Z, U & V) where most people with a horizontal five speaker, or less, system would utilise channels W, X and Y. Systems with height capability would use the Z channel and users with a higher number of speakers on the horizontal plane would also use the U and V signals. This six channel system has the advantage that the best possible resolution can be achieved on the horizontal plane (i.e. 2nd order). While the equations for tumbling and tilting the sound field will now only be fully utilisable when using the first order signals, rotating will still function, as only the horizontal Ambisonic channels are altered. 4.4 Conclusions With the use of three existing systems, a system has been proposed that overcomes the weaknesses of the individual systems in isolation. This system has the benefit of future-proofing in terms of speaker layout and can be decoded to headphones or two or more speakers whilst still retaining spatial information. Basic algorithms for the conversion processes have been described and will be analysed, discussed and optimised in Chapter 5. - 110 - Chapter 5 Chapter 5 - Surround Sound Optimisation Techniques 5.1 Introduction In this chapter a number of optimisation methods will be discussed and demonstrated so as to maximise the performance of the hierarchical system discussed in Chapter 4. A large part of this research was based upon the use of HRTF data collected by Gardner & Martin (1994) which was used in order to help quantify and optimise the various decoding stages that are present in the proposed hierarchical system. The research was carried out in a number of stages which also corresponds to the layout of this chapter, as detailed below: • Investigation into the use of HRTF data in the analysis of multi-channel • sound reproduction algorithms. • Optimisation of the binaural decoding signal processing techniques. • Optimisation of the Ambisonics decoding signal processing techniques. Optimisation of the Transaural decoding signal processing techniques. To this end, the first part of this investigation, documented in section 5.2, was to carry out a listening test, using the Multi-Channel Research Lab designed and installed as part of this research (Schillebeeckx et al., 2001), to try and measure the potential strengths and weaknesses of the proposed HRTF analysis technique. As the listening tests were executed before the research into the Ambisonic optimisation methods were carried out, sub-optimal Ambisonic decodes were used in these tests. Also, as work had only just begun on the Transaural processing techniques, and due to the extremely sub-optimal performance of the designed filters, this work is not included. Section 5.3 represents the bulk of this chapter, and concentrates on the optimisation of the Ambisonics system, as this is the base system from which the binaural and transaural representations will be derived from. Although it would be preferable to always derive the binaural/transaural feeds from the original B-format (or higher order) carrier, due to the standards used in current consumer and professional audio equipment (i.e. 5, 6 or 7 channel - 111 - Chapter 5 presentation for a 5, 6 or 7 speaker, irregular array) it is necessary to realise optimised Ambisonic decoders for irregular arrays not only to maximise the performance of the speaker decode, but to also make sure that the correct psychoacoustic cues are presented to a listener after this irregular decode is converted to a binaural or transaural reproduction. The original optimisation, as proposed by Gerzon & Barton (1992) is an extension of the original Ambisonic energy and velocity vector theory used to optimise regular decoders (Gerzon, 1977a) but with the added suggestion of using one decoder for low frequencies and another for high frequencies. However, although Gerzon and Barton (1992) did solve these equations for a number of irregular speaker arrays, none of the arrays were similar to the ITU standard array that was finally proposed. No decoders optimised in this way have ever been produced for the ITU standard speaker array since that time, as was evident in the recent Project Verdi Listening Tests (Multi Media Projekt Verdi, 2002). The equations, a set of non-linear simultaneous equations, were difficult to solve, and only got more difficult when more speakers were added (Gerzon & Barton, 1992). For this reason one of the main aims of this work was to devise a system so that Ambisonic decoders for irregular speaker arrays could be easily designed via some form of automated system. After this was successfully implemented, the analysis method suggested in earlier work (see Wiggins et al, 2001) was used as the basis of new optimisation criterion for irregular Ambisonic decoders. As no method of differentiation between decoders optimised using the energy/velocity vector model currently exists (there are multiple solutions), this new method could then be used as a method to differentiate between already designed velocity/energy vector decoders. Section 5.4 documents the work carried out on both Binaural and Transaural reproduction techniques. The work on binaural reproduction is used as an introduction to inverse filtering techniques, which are then applied to the Transaural reproduction system in order to improve its performance using the freely available HRTF data from MIT Media Lab (Gardner & Martin, 1994). - 112 - Chapter 5 5.2 The Analysis of Multi-channel Sound Reproduction Algorithms Using HRTF Data 5.2.1 The Analysis of Surround Sound Systems Much research has been carried out into the performance of multi-channel sound reproduction algorithms, both subjectively and objectively. Much of the quantitative data available on the subject has been calculated by mathematically simulating acoustical waves emitting from a number of fixed sources (speakers) (Bamford, 1995) or using mathematical functions that give an indication of the signals reaching the listener (Gerzon, 1992b). The resulting sound field can then be observed. In this section of Chapter 5, a new method of analysis will be described using Head Related Transfer Functions as a reference for the localisation cues needed to successfully localise a sound in space. This method will then be compared to results obtained from a listening test carried out at the University of Derby’s MultiChannel Sound Research Laboratory. 5.2.2 Analysis Using HRTF Data The underlying theory behind this method of analysis is that of simple comparison. If a real source travels through 3600 around the head (horizontally) and the sound pressure level at both ears is recorded, then the three widely accepted psychoacoustic localisation cues (Gulick et al., 1989; Rossing, 1990) can be observed. These consist of the time difference between the sounds arriving at each ear due to different path lengths, the level difference between the sounds arriving at each ear due to different path lengths and body shadowing/pinna filtering, a combination of complex level and time differences due to the listeners own pinna and body. The most accurate way to analyse and/or reproduce these cues is with the use of Head Related Transfer Functions. For the purpose of this analysis technique, the binaural synthesis of virtual sound sources is taken as the reference system, as the impulse responses used for this system are of real sources in real locations. The HRTF set used does not necessarily need to be optimal for all listeners (which can be an - 113 - Chapter 5 issue for binaural listening) so long as all of the various localisation cues can be easily identified. This is the case because this form of analysis compares the difference between real and virtual sources and as all systems will be synthesised using the same set of HRTFs, their performance when compared to another set of HRTFs should not be of great importance. Once the system has been synthesised using HRTFs, impulse responses can be calculated for virtual sources from any angle so long as the panning laws for the system to be tested are known. Once these impulse responses have been created the three parameters used for localisation can be viewed and compared, with estimations made as to how well a particular system is able to produce accurate virtual images. Advantages of this technique include: • All forms of multi-channel sound can potentially be analysed meaningfully • using this technique. • systems as long as the HRTFs used to analyse the systems are the same. Direct comparisons can be made between very different multi-channel Systems can be auditioned over headphones. 5.2.3 Listening Tests In order to have a set of results to use as a comparison for this form of analysis, a listening test was carried out. The listening test comprised of a set of ten tests for five different forms of surround sound: • • • • • 1st Order Ambisonics over 8 speakers (horizontal only) 2nd Order Ambisonics over 8 speakers (horizontal only) 1st Order Ambisonics over a standard 5 speaker layout. Amplitude panned over a standard 5 speaker layout. Transaural reproduction using two speakers at +/- 50. The tests were carried out in the University of Derby’s Multi Channel Sound Research Laboratory with the speakers arranged as shown in Figure 5.1. - 114 - Chapter 5 Figure 5.1 Speaker Arrangement of Multi-channel Sound Research Lab. The listing room has been acoustically treated and a measurement of the ambient noise in the room gave around 43 dBA in most 1/3-octave bands, with a peak at 100 Hz of 52.1 dBA and a small peak at 8 kHz of 44.4 dBA. The RT60 of the room is 0.42 seconds on average, but is shown in 1/3-octave bands in Figure 5.17. Using a PC and a multi-channel soundcard (Soundscape Mixtreme) all of the speakers could be accessed simultaneously (Schillebeeckx et al., 2001), if needed, and so tests on all of the systems could be carried out in a single session without any pauses or equipment changes/repatching. A flexible framework was devised using Matlab and Simulink (The Mathworks, 2003) so that listening test variables could be changed with minimal effort, with the added bonus that the framework would be reusable for future tests. A Simulink ‘template’ file was created for each of the five systems that could take variables from the Matlab workspace, such as input signal, overall gain and panning angle, as shown in Figure 5.2. Then a GUI was created where all of the variables could be entered and the individual tests run. A screen shot of the final GUI is shown in Figure 5.3. - 115 - Chapter 5 Figure 5.2 Screen shot of two Simulink models used in the listening tests. Figure 5.3 Screen shot of listening test GUI. The overall gain parameter was included so each of the different systems could be configured to have a similar subjective gain, with the angle of the virtual source specified in degrees. The only exception to this was the 5.0 Amplitude panned system where the speaker feeds were calculated off line using the Mixtreme soundcards internal mixing feature. The extra parameter (tick box) in the Stereo Dipole (transaural) section was used to indicate which side of the listener the virtual source would be placed as the HRTF set used (Gardner & Martin, 1994) only had impulse responses for the right hemisphere and must be reversed in order to simulate sounds originating from the left (indicated by a tick). - 116 - Chapter 5 After consulting papers documenting listening tests of various multi-channel sound systems, it was found that noise (band-limited and wide-band) was often used as a testing source (see Moller et al., 1999, Kahana et al., 1997, Nielsen, 1991 Orduna et al., 1995 and Zacharov et al, 1999, as typical examples). The noise signals used in this test were band limited and pulsed, three pulses per signal, with each pulse lasting two seconds with one second of silence between each pulse. The pulsed noise was chosen as it was more easily localised in the listening room when compared to steady state noise. Each signal was band limited according to one of the three localisation frequency ranges taken from two texts (Gulick et al., 1989; Rossing, 1990). These frequencies are not to be taken as absolutes, just a starting point for this line of research. A plot of the frequency ranges for each of the three signals is shown in Figure 5.4. Figure 5.4 Filters used for listening test signals. Twenty eight test subjects were used, most of whom had never taken part in a listening test before. The test subjects were all enrolled on the 3rd year of the University’s Music Technology and Audio System Design course, and so knew the theory behind some surround sound systems, but had little or no listening experience of the systems at this point. Each listener was asked to try to move their head as little as possible while listening (i.e. don’t face the source), and to indicate the direction of the source by writing the angle, in degrees, on an answer paper provided. It must be noted that the head of the listeners were not fixed and so small head movements would have been - 117 - Chapter 5 available to the listeners as a potential localisation cue (as it would be when listening anyway). Listeners could ask to hear a signal again if they needed to, and the operator only started the next signal after an answer had been recorded. The listeners were given a sheet of paper to help them with angle locations with all of the speaker positions marked in a similar fashion to Figure 5.5 (although the sheet presented to the test subjects was labelled in 50 intervals with a tick size of 10, not 150 intervals with a tick size of 30 as shown in Figure 5.5). 345 0 15 30 330 315 45 300 60 285 75 270 90 255 105 120 240 135 225 210 150 195 Figure 5.5 180 165 Figure indicating the layout of the listening room given to the testees as a guide to estimating source position. 5.2.4 HRTF Simulation As described in section 5.1 three of the five systems will be analysed using the HRTF method described above: • • • 1st Order Ambisonics 2nd Order Ambisonics 1st Order Ambisonics over 5 speakers. The listening test results for the amplitude panned 5 speaker system are also included. The set of HRTFs used for this analysis were the MIT media lab set of HRTFs, specifically the compact set (Gardner & Martin, 1994). As mentioned - 118 - Chapter 5 earlier, it is not necessarily important that these are not the best HRTF set available, just that all of the localisation cues are easily identifiable. All systems can be simulated binaurally but Ambisonics is a slightly special case as it is a matrixed system comprising the steps shown in Figure 5.6. W Left Ear X Figure 5.6 HRTF Simulation Ambisonic Decoder Y Right Ear The Ambisonic to binaural conversion process. Because the system takes in three channels which are decoded to eight speaker feeds, which are then decoded again to two channels, the intermediate decoding to eight speakers can be incorporated into the HRTFs calculated for W, X and Y meaning that only six individual HRTFs are needed for any speaker arrangement, Equation (5.1). If the head is assumed to be symmetrical (which it is in the MIT set of compact HRTFs) then even fewer HRTFs are needed as Wleft and Wright will be the same (Ambisonics omnidirectional component), Xleft and Xright will be the same (Ambisonics front/back component) and Yleft will be phase inverted with respect to Yright. This means a complete 1st order Ambisonic system comprising any number of speakers can be simulated using just three HRTF filters, as shown in equation (5.1). W hrtf = ( 2 )× ∑ (S ) ( ) = ∑ (sin (θ )sin (φ ) × S ) 8 X hrtf = ∑ k =1 cos(θ k )sin (φ k ) × S khrtf k =1 hrtf k 8 Y hrtf 8 k =1 k k hrtf k Where θ = source azimuth φ = source elevation (0 for horizontal only) Skhrtf = Pair of Speakers positional HRTFs. (5.1) - 119 - Chapter 5 Once the HRTFs for W, X and Y are known, a virtual source can be simulated by using the first order Ambisonics encoding equations shown in Equation ( ) (5.2), (Malham, 1998). W = 1 2 × x ( n) X = cos(θ ) × sin(φ ) × x(n) Y = sin(θ ) × sin(φ ) × x(n) Where x(n) is the signal to be placed in virtual space. (5.2) Using two sets of the W, X and Y HRTFs (one for eight and one for five speaker 1st order Ambisonics) and one set of W, X, Y, U and V (Bamford, 1995; Furse, n.d.) for the 2nd order Ambisonics, sources were simulated from 00 to 3600 in 50 intervals. The 50 interval was dictated by the HRTF set used since, although the speaker systems could now be simulated for any source angle, the real sources (used for comparison) could only be simulated at 50 intervals (without the need for interpolation). An example pair of HRTFs for a real and a virtual source are shown in Figure 5.7. 1st Order Ambisonics, Source at 45 degrees (Left Ear) 1st Order Ambisonics, Source at 45 degrees (Right Ear) 1 1 Real Source Ambisonic Source Real Source Ambisonic Source 0.5 0.5 0 0 -0.5 -0.5 -1 0 20 Figure 5.7 40 60 80 100 -1 120 0 20 40 60 80 100 120 120 Example left and right HRTFs for a real and virtual source (1st Order Ambisonics) at 450 clockwise from centre front. 5.2.5 Impulse Response Analysis As mentioned in Section 5.2.2, three localisation cues were analysed, interaural level difference, interaural time difference, and pinna filtering effects. The impulse responses contain all three of these cues together meaning that although a clear filter delay and level difference can be seen by inspection; the pinna filtering will make both the time and level differences - 120 - Chapter 5 frequency dependant. These three cues were extracted from the HRTF data using the following methods: • Interaural Amplitude Difference – Mean amplitude difference between the • two ears, taken from an FFT of the impulse responses. • taken from the group delay of the impulse responses. Interaural Time Difference – Mean time difference between the two ears, Pinna filtering – Actual time and amplitude values, taken from the group delay and an FFT of the impulse responses. Once the various psychoacoustic cues had been separated, comparisons were made between the cues present in a multi-speaker decode compared with the cues of an actual source (i.e. the individual HRTFs) and estimations of where the sounds may appear to come from can be made using each of the localisation parameters in turn. As the analysis is carried out in the frequency domain, band limiting the results (to coincide with the source material used in the listening tests) is simply the case of ignoring any data that is outside the range to be tested. As an example, Figure 5.8 shows the low, mid and high frequency results for real sources and the three Ambisonic systems for averaged time and amplitude differences between the ears. These graphs show a number of interesting points about the various Ambisonic systems. Firstly, the 2nd order system actually has a greater amplitude difference between the ears at low frequencies when compared to a real source, and this is also the frequency range where all of the systems seem to correlate best with real sources. However, the ear tends to use amplitude cues more in the mid frequency range, and another unexpected result was also discovered here. It seems that the 1st order, five speaker system actually outperforms the 1st order, eight speaker system at mid frequencies, and seems to be equally as good as the eight speaker, second order system. This is not evident in the listening tests, but if the average time difference graphs are observed it can be seen that the five speaker system has a number of major errors around the 900 and 2700 source positions and - 121 - Chapter 5 shows the 2nd order system to hold the best correlation. The time difference plots all show that the five speaker system still outperforms the 1st order, eight speaker system, apart from the major disparities, mentioned above, at low frequencies. It can be seen from the listening test results (Figure 5.12) that the five speaker system does seem to be at least as good as the eight speaker system over all three of the frequency ranges, which was not expected. The mid and high frequency range graphs are a little too complicated to analyse by inspection and so will be considered later in this chapter using a different technique. It must also be noted that, due to the frequency ranges originally chosen, interaural level differences at low frequencies are comparable to the interaural level differences at mid frequencies. Had a lower cut off frequency been chosen (as shown later in this Chapter) this would not have been the case and this suggests that the original frequency ranges were not ideal. - 122 - Chapter 5 Mid Frequency Amplitude Difference (Average) 1.5 1 1 0.5 0.5 Amplitude Difference Amplitude Difference Low Frequency Amplitude Difference (Average) 1.5 0 -0.5 Actual 5.0 Ambi 8 Speak Ambi 2nd Order -1 -1.5 0 50 100 150 200 250 300 350 0 -0.5 -1.5 400 Actual 5.0 Ambi 8 Speak Ambi 2nd Order -1 0 50 100 150 200 250 300 350 400 Source Angle (degrees) Source Angle (degrees) High Frequency Amplitude Difference (Average) Low Frequency Time Difference (Average) 1.5 80 60 1 Time Difference Amplitude Difference 40 0.5 0 -0.5 Actual 5.0 Ambi 8 Speak Ambi 2nd Order -1 -1.5 0 50 100 150 200 250 300 350 20 0 -20 -40 Actual 5.0 Ambi 8 Speak Ambi 2nd Order -60 -80 400 0 50 Source Angle (degrees) 60 30 40 20 20 10 Time Difference Time Difference 40 0 -20 -40 Actual 5.0 Ambi 8 Speak Ambi 2nd Order 0 50 100 150 200 250 250 300 350 400 300 350 0 -10 -20 Actual 5.0 Ambi 8 Speak Ambi 2nd Order -30 400 -40 Source Angle (degrees) Figure 5.8 200 High Frequency Time Difference (Average) Mid Frequency Time Difference (Average) -80 150 Source Angle (degrees) 80 -60 100 0 50 100 150 200 250 300 350 400 Source Angle (degrees) The average amplitude and time differences between the ears for low, mid and high frequency ranges. - 123 - Chapter 5 Figure 5.9 The difference in pinna amplitude filtering of a real source and 1st and 2nd order Ambisonics (eight speaker) when compared to a real source. One attribute that has not really been touched on yet, when discussing multispeaker systems, which is one of the major consequences of the phantom imaging scenario, is pinna cue errors. When an image is created with more than one speaker, although it is possible to create a correct level and phase difference at the ears of a listener, for a panned source, it will be far more difficult to create correct pinna cues due to the direction dependant filtering that the pinnae apply to real sound sources. Instead, the pinna cues from the speakers creating the phantom image will be summed and weighted dependant on the speakers’ contributions. As everyone’s pinnae are different, it is impossible to correct for this in a generic way (and even from an individual’s response point of view, only one listener orientation could be corrected for, i.e., facing straight ahead). The pinna filtering can be clearly seen in the simulation, but is a more complex attribute to analyse directly, although it has been useful to look at for a number of reasons. For example, if the non-averaged amplitude or group delay parameters are looked at over the full 3600 (the non-averaged amplitude responses are shown in Figure 5.9) - 124 - Chapter 5 it can be seen that they both change radically due to virtual source position (as does a source in reality). However, the virtual sources change differently when compared to real sources. This change will also occur if the head is rotated (in the same way as a source moving for a regular rig, or a slightly more complex way for an irregular five speaker set-up) and this could be part of the ‘phasiness’ parameter that Gerzon often mentioned in his papers regarding the problems of Ambisonics (Gerzon, 1992b). This problem, however, is not strictly apparent as a timbral change (at least, not straight away) when a source or the listener’s head moves, but instead probably just aids in confusing the brain as to the sound source’s real location, increasing source location ambiguity and source movement when the listener’s head is turned. This parameter is more easily observed using an animated graph, but it is shown as a number of stills in Figure 5.9. These graphs show the differences between the three systems, which is why the ‘real source’ is just a 0dB line, as it has no amplitude difference with itself. Due to the complexity of the results obtained using the HRTF simulation for the pinna filtering, it is difficult to utilise these results in any estimation of localisation error, although further work will be carried out to make use of this information. However, using the average time and amplitude differences to estimate the perceived direction of the virtual sound source is a relatively trivial task using simple correlation between the actual and virtual sources. In order to plot these results, a Matlab routine was constructed that gave a localisation estimation using the HRTFs derived from the various decoders and compared these to the figures obtained from the real HRTFs. This was carried out for both amplitude and time differences in the various frequency bands tested. Because no pinna filtering effects were taken into account, each value of amplitude and time/phase difference will have two corresponding possible localisation angles (see the cone of confusion in chapter 2.2.1). Figure 5.10, Figure 5.11 and Figure 5.12 show the listening test results with the estimated localisations also shown, using the average amplitude and the average time differences at low and mid frequencies. - 125 - Chapter 5 The listening tests themselves gave reasonably expected results as far as to the system that performed best (the 2nd Order Ambisonics system). However the other three systems (1st order eight and five speaker, and amplitude panned 5.0) all seemed to perform equally as well, which was not expected. Although it must be noted, that all of these listening tests were carried out using ‘unoptimised’ decoders, with only the five speaker irregular decoder having been empirically adjusted regarding the amplitude levels of the three speaker sets (centre, front pair and rear pair). Nevertheless, the empirically derived gain settings reasonably matched the optimised sets described later (quiet centre speaker with additional gain applied to the rear pair) but with all speakers using a cardioid pattern feed. The speakers used for the eight and five speaker systems were different, but as all listeners had the speakers pointed directly at them, and were tested using band-limited noise, the frequency response and dispersion patterns of the speakers should not have been critical in this experiment. Also, the HRTF simulation and comparison should be a valid one as long as the speakers used in each system are matched (as opposed to the speakers across all systems being the same). The frequency content of the sounds did not seem to make any significant difference to the perceived localisation of the sound sources, although a more extensive test would have to be undertaken to confirm this, as the purpose of this test was to test between any large differences between the three localisation frequency ranges. Another interesting result was the virtual source at 00 on the amplitude panned system (see Figure 5.13). As there is a centre front speaker, a virtual source at 00 just radiates from the centre speaker, i.e. it is a real source at 00. However, around 30% of the subjects recorded that the source came from behind them. Front/back reversals were actually less common in all of the other systems (at 00), apart from 2nd order Ambisonics (the system that performed best). The source position estimation gave reasonably good results when compared with the results taken from the listening tests, with any trends above or below - 126 - Chapter 5 the diagonal, representing a perfect score, being estimated successfully. If the graphs represented truly what is expected from the different types of psychoacoustic sound localisation, then the low frequency time graph and the mid frequency amplitude graph should be the best indicator of where the source is coming from. However it is well known (Gulick et al., 1989) that if one localisation cue points to one direction, and the other cue points to another, then it may be some direction between these two localisation angles that the sound is actually perceived to originate from. The HRTF analysis does not take this into account at the moment and so some error is expected. Also, the compact set of HRTFs used is the minimum phase versions of the actual HRTFs recorded which may contribute to the time difference estimation results (although the cues seem reasonable when looked at for the actual sources). As mentioned, there was no major difference between the three different signals in terms of localisation error. Because of this the plots showing the estimated localisation using the whole frequency range are shown in Figure 5.14 - Figure 5.16 which also show the interaural amplitude difference as a better localisation approximation. 5.2.6 Summary The HRTF analysis of the three surround systems described in this section seems to work well giving a reasonably good indication as to the possible localisation that a listener will attach to a sound object. This method is definitely worth pursuing as a technique that can be used to evaluate and compare all forms of surround sound systems equally. Although the errors seen in the estimation when compared to the listening test results can be quite large, the general trends were shown accurately, even with such a simple correlation model used. - 127 - Chapter 5 400 1st Order Ambisonics 350 Low Pass Filtered Signal 300 Perceived Angle Band Pass Filtered Signal High Pass Filtered Signal 250 200 150 100 50 0 -50 0 50 100 150 200 250 300 350 -50 Actual Source Angle Source Localisation Estimates using Interaural Amplitude Low Frequency 400 400 400 400 350 350 350 350 300 300 300 300 250 250 250 250 200 200 200 200 150 150 150 150 100 100 100 100 50 50 50 50 00 0 -50 Mid Frequency differences 00 50 50 100 100 150 150 200 200 250 250 300 300 350 350 -50 00 50 50 100 100 150 150 200 200 250 250 300 300 350 350 -50 -50 Source Localisation Estimates using Interaural Time differences Mid Frequency Low Frequency 400 400 400 350 350 350 300 300 300 250 250 250 200 200 200 150 150 150 100 100 100 50 50 50 0 0 0 -50 50 50 100 100 150 150 200 200 250 250 300 300 350 350 00 -50 -50 Figure 5.10 0 0 50 50 100 100 150 150 200 200 250 250 300 300 -50 Listening Test results and estimated source localisation for 1st Order Ambisonics - 128 - 350 350 Chapter 5 400 2nd Order Ambisonics 350 Low Pass Filtered Signal 300 Band Pass Filtered Signal High Pass Filtered Signal Perceived Angle 250 200 150 100 50 Low Pass Filtered Signal 0 Band Pass Filtered Signal High Pass Filtered Signal -50 0 50 100 150 200 250 300 350 -50 Actual Source Angle Source Localisation Estimates using Interaural Amplitude Mid Frequency differences Low Frequency 400 400 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 50 100 150 200 250 300 0 350 50 100 150 200 250 300 350 Source Localisation Estimates using Interaural Time differences Mid Frequency Low Frequency 400 400 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 50 Figure 5.11 100 150 200 250 300 0 350 50 100 150 200 250 300 Listening Test results and estimated source localisation for 2nd Order Ambisonics - 129 - 350 Chapter 5 400 5.0 Ambisonics Low Pass Filtered Signal 350 Band Pass Filtered Signal Perceived Angle 300 High Pass Filtered Signal 250 200 150 100 50 0 -50 0 50 100 150 200 250 300 350 -50 Actual Source Angle Source Localisation Estimates using Interaural Amplitude Mid Frequency differences Low Frequency 400 400 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 50 100 150 200 250 300 0 350 50 100 150 200 250 300 350 Source Localisation Estimates using Interaural Time differences Mid Frequency Low Frequency 400 400 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 Figure 5.12 50 100 150 200 250 300 0 350 0 50 100 150 200 250 Listening Test results and estimated source localisation for five speaker 1st Order Ambisonics - 130 - 300 350 Chapter 5 400 Amplitude Panned 5.0 Low Pass Filtered Signal 350 Band Pass Filtered Signal 300 High Pass Filtered Signal 250 Perceived Source 200 150 100 50 0 -50 0 50 100 150 200 250 300 350 -50 Actual Source Angle Figure 5.13 Listening test results for Amplitude Panned five speaker system. 400 350 300 250 200 Average Time difference 150 Average Amplitude difference 100 50 0 0 Figure 5.14 50 100 150 200 250 300 350 Average Time and Frequency Localisation Estimate for 1st Order Ambisonics. - 131 - Chapter 5 400 2nd Order Ambisonics 350 Low Pass Filtered Signal 300 Band Pass Filtered Signal High Pass Filtered Signal 250 200 Average Time difference 150 Average Amplitude difference 100 50 Low Pass Filtered Signal 0 Band Pass Filtered Signal High Pass Filtered Signal 0 50 100 150 200 250 300 350 Average Time and Frequency Localisation Estimate for 2nd Order Ambisonics. Figure 5.15 400 5.0 Ambisonics Low Pass Filtered Signal 350 Band Pass Filtered Signal 300 High Pass Filtered Signal 250 Average Time difference 200 Average Amplitude difference 150 100 50 0 0 Figure 5.16 50 100 150 200 250 300 350 Average Time and Frequency Localisation Estimate for five speaker 1st Order Ambisonics. - 132 - Chapter 5 RT60 For Multi-channel SoundResearch Laboratory. 1.00 0.90 0.80 RT60 (seconds) 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 RT60 Time (s) 0.125 0.160 0.200 0.250 0.315 0.400 0.500 0.630 0.800 1.000 1.250 1.600 2.000 2.500 3.150 4.000 5.000 6.300 8.000 10.000 0.65 0.65 0.65 0.50 0.45 0.30 0.30 0.30 0.35 0.35 0.35 0.35 0.40 0.55 0.45 0.40 0.35 0.30 0.35 0.30 Frequency (kHz) Figure 5.17 RT60 Measurement of the University of Derby’s multi-channel sound research laboratory, shown in 1/3 octave bands. 5.3 Optimisation of the Ambisonics system 5.3.1 Introduction In this part of the chapter the decoding techniques that have been utilised in the system described in Chapter 4 (Ambisonics, binaural and transaural) will be discussed and optimised so as to both maximise their spatial performance and sound quality. Some of these optimisations are more logically formulated than others, with the optimisation of the Ambisonics system being the most involved, both mathematically and perceptually, so this system will be considered first. As discussed in Chapter 4, the Ambisonics system will be the basis for the proposed hierarchical multi-channel system, but while the encoding process is a fixed standard (using the spherical harmonics described in Chapter 3) the decoding process is not necessarily as straightforward. As the Ambisonics system is very flexible, any 1st order microphone response can be chosen, along with the virtual microphone’s direction. Gerzon’s original theory stated that the virtual microphone response for the decoder (he concentrated on regular setups initially) should be chosen according to a number of - 133 - Chapter 5 mathematical approximations to the signals that would reach the ear of a listener (Gerzon, 1974) and, for regular speaker arrays, this was a relatively straightforward optimisation to perform (see section 3.3.1.2). However, since the introduction of the DVD, the standard speaker layout as specified by the ITU is a five speaker layout as shown in Figure 5.18. This is likely to be expanded upon in the near future, and other, larger, venues are likely to have more speakers to cover a larger listening area. 0 60 0 0 80 80 0 140 Figure 5.18 Recommended loudspeaker layout, as specified by the ITU. Due to the likelihood of ever changing reproduction layouts a more portable approach should be used in the creation of multi-channel material, and such a system has been around since the 1960s (Borwick, 1981). Ambisonic systems are based on a spherical decomposition of the sound field to a set order (typically 1st or 2nd order (Malham, 2002; Leese, n.d.)). The main benefit of the Ambisonic system is that it is a hierarchical system, that is, once the sound field is encoded in this way (into four channels for 1st order, and 9 channels for 2nd order) it is the decoder that decides how this sound field is reconstructed using the Ambisonic decoding equations (Gerzon, 1977b). This system has been researched, mainly by Gerzon, and in 1992 papers were published suggesting a method of optimising Ambisonic decoders for irregular speaker arrays (Gerzon & Barton, 1992) as the original decoding equations were difficult to solve for irregular speaker arrays in the conventional way (use of shelving filters (Gerzon, 1974)). - 134 - Chapter 5 5.3.2 Irregular Ambisonic Decoding In order to quantify decoder designs Gerzon decided on two main criteria for designing and evaluating multi-speaker surround sound systems in terms of their localisation performance. These represent the energy and velocity vector components of the sound field (Gerzon, 1992c). The vector lengths represent a measure of the ‘quality’ of localisation, with the vector angle representing the direction that the sound is perceived to originate from, with a vector length of one indicating a good localisation effect. These are evaluated as shown in Equation (5.3) P = ∑ gi n i =1 E = ∑ g i2 n i =1 Vx = ∑ g i cos(θ i ) P Ex = ∑ g i2 cos(θ i ) E n n i =0 i =0 Vy = ∑ g i sin (θ i ) P n i =0 Ey = ∑ g i2 sin (θ i ) E n i =0 (5.3) Where: gi represents the gain of a speaker (assumed real for simplicity). n is the number of speakers. θi is the angular position of the ith speaker. For regular speaker arrays, this was simply a case of using one virtual microphone response for low frequencies and a slightly different virtual microphone response for the mid and high frequencies by the use of shelving filters (Farino & Uglotti, 1998) as shown in Figure 5.19 and Figure 5.20. This is extremely similar to the theory and techniques used by Blumlein’s spatial equalisation described in Chapter 2. - 135 - Chapter 5 Virtual microphone responses for a 1st order, eight speaker rig 90 1 60 120 0.8 0.6 150 30 0.4 0.2 180 0 330 210 HF Polar Response LF Polar Response 240 300 270 Figure 5.19 Virtual microphone polar plots that bring the vector lengths in Equation (5.3) as close to unity as possible (as shown in Figure 5.21), for a 1st order, eight speaker rig. D low = 1 : D high = 1 1 0.5 0 -0.5 -1 -1.5 Figure 5.20 -1 -0.5 0 0.5 1 1.5 Velocity and energy localisation vectors. Magnitude plotted over 3600 and angle plotted at five discrete values. Inner circle represents energy vector, outer circle represents velocity vector. Using virtual cardioids. As long as the virtual microphone patterns were the same for each speaker, the localisation angle was always the same as the encoded source angle, just the localisation quality (length of the vector) was affected by changing the polar patterns. - 136 - Chapter 5 D low = 1.33 : D high = 1.15 1 0.5 0 -0.5 -1 -1.5 Figure 5.21 -1 -0.5 0 0.5 1 1.5 Velocity and energy localisation vectors. Magnitude plotted over 3600 and angle plotted at five discrete values. Inner circle represents energy vector, outer circle represents velocity vector. Using virtual patterns from Figure 5.19. However, when non-regular speaker arrays are used, not only do the vector magnitudes need to be compensated for, but the replay angle and overall volume of the decoded sound need to be taken into account. This results from the non-uniformity of the speaker layout. For example, if all of the speakers had the same polar pattern then a sound encoded to the front of a listener would be louder over an ITU five speaker system than a sound emanating from the rear, due to the higher density of speakers at the front of the speaker array. Also, the perceived direction of the reproduced sound would also be distorted, as shown in Figure 5.22. - 137 - Chapter 5 Speakers Energy Vector Figure 5.22 Velocity Vector 0,12.25,22.5, 45,90 & 135 degrees reproduced angles Energy and velocity vector response of an ITU 5-speaker system, using virtual cardioids. These artefacts are not a problem when you are producing audio for a fixed setup (i.e. amplitude panned 5.1) as material is mixed so it sounds correct on the chosen speaker layout. However, as the point of using a hierarchical surround sound format is that an audio piece should sound as similar as possible on as many speaker layouts as possible, these artefacts must be corrected after the encoding has occured, that is, during the decoding stage. Due to the added complexity of the speaker array’s response to an Ambisonic system, Gerzon and Barton (1992) proposed that two separate decoders be used, one for low frequency (<~700Hz) and another for high frequencies (>~700 Hz). This can be achieved using a simple cross-over network feeding low and high passed versions of the Ambisonic B-format signals to the two decoders. It is also important that the cross-over filters are perfectly phase matched so that the reinforcement and cancellation principles used by Ambisonics still function correctly. 5.3.3 Decoder system 1st order Ambisonics is comprised of four different signals, as shown in Figure 5.23, and comprises of an omni-directional pressure signal (W), a front-back - 138 - Chapter 5 figure of eight (X), a left-right figure of eight (Y), and an up-down figure of eight (Z). W Y X Z Figure 5.23 Polar patterns of the four B-format signals used in 1st order Ambisonics. As the 5-speaker system shown in Figure 5.18 is a horizontal only system, only three of the four available B-format signals are needed to feed the decoder (W, X and Y). Also, as the speaker array in Figure 5.18 is left/right symmetric, we can also assume that the decoder coefficients work in pairs (i.e. sums and differences). The Ambisonic encoding equations are given in Equation (5.4). W= 1 2 X = cos(θ ) Y = sin(θ ) where θ is the encoded angle, taken anti-clockwise from straight ahead. (5.4) As another tool in the decoding of the sound field, it will be seen that the use of a ‘frontal dominance’ parameter is useful, as shown in Equation (5.5). This is not the best form of the frontal dominance equation (it has a non-linear response to the dominance parameter), but it is used to keep compatibility with Gerzon’s previous paper on this subject (Gerzon & Barton, 1992). - 139 - ( ) (λ − λ )X X ′ = 0.5(λ + λ )X + 2 (λ − λ )W W ′ = 0.5 λ + λ−1 W + 8 −1 Y′ = Y − − 1 2 −1 1 2 −1 Chapter 5 where λ is the forward dominance parameter (>1 for front, and <1 for rear (5.5) dominance). These encoding equations are then substituted into the decoding equations to give a numerical value for each speaker’s output to a particular signal as given in Equation (5.6). In this equation it can be seen that what were previously sine and cosine (i.e. directionally dependant) weightings are now arbitrary values (nominally to be chosen between 0 and 1), denoted by kW, kX and kY. C F = (kWC × W ′) + (kX C × X ′) LF = (kWF × W ′) + (kX F × X ′) + (kYF × Y ′) RF = (kWF × W ′) + (kX F × X ′) − (kYF × Y ′) LB = (kWB × W ′) + (kX B × X ′) + (kYB × Y ′) RB = (kWB × W ′) + (kX B × X ′) − (kYB × Y ′) (5.6) where k denotes a decoding coefficient (e.g. kWc represents the weighting given to the W channel for centre front speaker). F, B and C denote front, back and centre speakers respectively. W’,X’ and Y’ represent the incoming B-format signals after potential transformation by the forward dominance equation. C, L and R denote centre, left and right speakers The values for λ and the ‘k’ values are to be chosen to optimise the decoder’s output, with λ having possible values between 0 and 2, and ‘k’ values having a nominal range between 0 and 1. Equation (5.7) shows the conditions which are used to assess the performance of a given solution. The conditions that must be met are: - 140 - Chapter 5 Radius of the localisation vector lengths (RV and RE) should be as close to 1 as possible for all values of θ. θ = θV=θE for all values of θ. PV=PE and must be constant for all values of θ. Vx = ∑ g i × cos( SPosi ) PV N i =1 Vy = ∑ g i × sin( SPosi ) P V N i =1 Ex = ∑ g i2 × cos( SPosi ) PE N i =1 Ey = ∑ g i2 × sin( SPosi ) P E N i =1 R E = E x2 + E y2 RV = V x2 + V y2 PV = ∑ g i n i =1 θ E = tan −1 (E y E x ) θ V = tan −1 (V y Vx ) PE = ∑ g i2 n i =1 (5.7) where: gi = Gain of the ith speaker SPosi = Angular position of the ith speaker. V denotes velocity vector E denotes energy vector The reason that these equations are difficult to solve is that the best result must be found over the whole listening area, spanning 3600. Even Gerzon admitted that these equations were laborious to solve for five speakers, and the more speakers present, i.e. the more values that must be optimised, the more laborious and time consuming finding the solution becomes. Also, there is more than one valid solution for each decoder (low frequency and high frequency) meaning that a group of solutions need to be found, and then auditioned, to determine the best set of coefficients. A system that can automatically calculate decoder coefficients is needed, and possibly one that can distinguish between sets of coefficients that meet the - 141 - Chapter 5 criteria set out by the energy and velocity vector theories. This system does not need to be particularly fast, as once a group of solutions are found the program should not need to be used again, unless the speaker layout changes. 5.3.4 The Heuristic Search Methods As a result of the fact that each parameter in the Ambisonic decoding equations will have a value within a well defined range, 0 to 1 or 0 to 2, a search method offers an effective solution to the array optimisation problem. However, if we wish to determine the settings to two decimal places there are 2 x 1018 possible solutions (given that there are 9 search parameters) and an exhaustive search is not feasible (Wiggins et al, 2003). When deciding on the type of heuristic method, an empirical approach was used. The most important part of any heuristic search method is the development of the fitness equations. These are the functions that give the heuristic search method the measure of the success of its choice. Care must be taken when choosing these functions to make sure that it is not possible for different error conditions to cancel each other out with the most logical solution to this problem being to ensure that any error in the decode results in a positive number. The fitness equations developed for this project are described later in this chapter. The first avenue of research taken was that of a Genetic Algorithm approach, as this is one of the better known heuristic methods. This was first implemented as a Matlab script and did not seem to converge to a good result, so the next system to try was one using an algorithm based on the Tabu search as this has been shown to converge more accurately when used in a small search space (Berry, S. & Lowndes V., 2001). It was while developing this algorithm that it was discovered that the initial velocity and energy vector calculations contained errors, and once corrected, the Tabu search algorithm performed as expected. As this tabu algorithm performed well, the genetic algorithm was not tried again at this point due to its known convergence problems as described above (Genetic Algorithms are better suited to a very large search space, which this problem did not have). - 142 - Chapter 5 This adapted form of Tabu search works by having the decoder coefficients initialised at random values (or values of a previous decoder, if these values are to be optimised further). Then the Tabu search program tries changing each of the ‘tweakable’ values, plus or minus the step size. The best result is then kept and the parameter changed is then restricted to only move in the successful direction for a set number of iterations (which, of course, will only happen if this parameter, again, is the best one to move). It must be noted that the random start position is of great importance, as it is this that helps in the search for a wide range of solutions. The most important part of the Tabu search algorithm is the equations used to measure the fitness of the decoder coefficients, as it is this one numerical value that will determine the course that the Tabu search takes. As mentioned above, three parameters must be used in an equation that represents the overall fitness of the decoder coefficients presented. These are: • • • Localisation measure (vector lengths, RV & RE). Localisation Angle (vector angles, θV & θE). Volume (Sound pressure gain, PV & energy gain, PE) of each encoded direction. As each of the parameters must be as good a fit as possible for the whole 3600 sound stage, the three parameters must be evaluated for a number of different encoded source positions. Gerzon evaluated these parameters at 14 points around the unit circle (7 around a semi-circle assuming left/right symmetry), but as computers can calculate these results so quickly, an encoded source resolution of 40 intervals would be used (90 points around the unit circle). Due to the large number of results for each of the fitness values an average was taken for each fitness parameter using a root mean square approach. If we take the example of the fitness of the vector lengths (localisation quality parameter), then if a mean average is taken, a less than one vector length in one part of the circle could be compensated for by a greater than one vector length elsewhere. However, if we take a good fit to be zero, and use a root mean square approach then a non-perfect fit around the circle will always give a positive error value, meaning that it is a true - 143 - Chapter 5 measure of the fitness. The equations used for each of the fitness parameters are shown in Equation (5.8). ⎛ P0 ⎞ n ⎜1 − Pi ⎟⎠ VFit = ∑ ⎝ n i =0 MFit = ∑ n i =0 2 (1 − Ri )2 n ⎛ Enc θ i − θ i ⎞⎟ n ⎜ ⎠ AFit = ∑ ⎝ n i =0 2 (5.8) where: P0 is the pressure at an encoded direction of 00. R represents the length of the vector at a direction, i. n is the number of points taken around the unit circle. θ is the encoded source angle and θ is the localisation angle. Enc V, M and AFit are the numerical fitness parameters used to measure the performance of a particular decoder (Volume, Magnitude and Angle). Given the three measures of fitness in Equation (5.8), the overall fitness for the high and low frequency versions of the decoder are actually calculated slightly differently. The low frequency decoder can achieve a near perfect fit, but the best fit that the high frequency decoder can expect to achieve is shown in Figure 5.32. The best results were obtained from the Tabu search algorithm if the overall fitness was weighted more towards the angle fitness, Afit from Equation (5.8), as shown in Equation (5.9). LFFitness = AFit + MFit + VFit HFFitness = AFit + (MFit + VFit ) 2 (5.9) A block diagram of the tabu search algorithm used in this research is shown in Figure 5.24. - 144 - Chapter 5 The main benefit of the Tabu search method is that all three of the conditions to be met can be optimised simultaneously, which had not been accomplished in Gerzon’s Vienna paper (Gerzon & Barton, 1992). For example if we take the speaker layout used in the Vienna paper, which is not the ITU standard but is reasonably similar (it is a more regular layout than the one the ITU specified after Gerzon’s paper was published), then the coefficients derived by Gerzon and Barton would give an energy and velocity vector response as shown in Figure 5.25. Several points are apparent from this figure. There is a high/low localisation angle mismatch due to the forward dominance being applied to the high frequency decoder’s input after the localisation parameters were used to calculate the values of the coefficients (as first reported in Wiggins et al., 2003). If the frontal dominance is applied to both the high and low frequency decoders, a perceived volume mismatch occurs with the low frequency decoder replaying sounds that are louder in the frontal hemisphere than in the rear. Also, even if these mismatches were not present (that is, the frontal dominance is not applied) every set of results presented in the Vienna paper showed a distortion of the decoder’s reproduced angles. Figure 5.25 shows a set of coefficients calculated using the Tabu search algorithm described in Figure 5.24 and demonstrates that if all three criteria are optimised simultaneously a decoder can be designed that has no angle or volume mismatches, and should reproduce a recording more faithfully than has been achieved in previous Ambisonic decoders for irregular arrays. - 145 - Chapter 5 Initial Decoder Coefs N=number of iterations Update Tabu List If allowed, add and subtract stepsize fromeach decoder coefficient Stepsize Loop, N=N-1 Store best local best result. Update Tabu’d coefficients and directions. no Is new result best? yes Store best overall result. Figure 5.24 A simple Tabu Search application. Speakers Velocity Vector Speakers Velocity Vector Sound Pressure Level Sound Pressure Level 0,12.25,22.5, 45,90 & 135 degrees reproduced angles Energy Vector Energy Vector Gerzon/Barton Decode Figure 5.25 0,12.25,22.5, 45, 90 & 135 degrees reproduced angles Wiggins Decode Graphical plot of the Gerzon/Barton coefficients published in the Vienna paper and the Wiggins coefficients derived using a Tabu search algorithm. Encoded/decoded direction angles shown are 00, 12.250, 22.50, 450, 900, 1350 and 1800. - 146 - Chapter 5 Tabu Search Path for W Coefficients Tabu Search Path for X Coefficients 1 1 W Centre W Front W Back 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 5 10 15 20 25 Iteration Number (x 50) 30 35 X Centre X Front X Back 0.9 Coef Value Coef Value 0.9 0 40 0 5 Tabu Search Path for Y Coefficients 30 35 40 1 Y Front Y Back 0.9 Fit 0.9 0.8 0.8 0.7 0.7 Fitness Value Coef Value 15 20 25 Iteration Number (x 50) Overall Fitness Values during Tabu Search 1 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 10 0 5 10 15 20 25 Iteration Number (x 50) 30 35 0 40 0 5 10 15 20 25 Iteration Number (x 50) 30 35 Figure 5.26 The transition of the eight coefficients in a typical low frequency Tabu search run (2000 iterations). The square markers indicate the three most accurate sets of decoder coefficients (low fitness). Figure 5.27 The virtual microphone patterns obtained from the three optimum solutions indicated by the squares in figure 5.25. While writing up this research thesis, Craven (2003) released a paper detailing how 4th order circular harmonics (i.e. Ambisonic, spherical harmonics without the height information) could be used to create an improved panning law for irregular speaker arrays. The example decoder Craven includes in his - 147 - 40 Chapter 5 paper has the velocity/energy vector representation and virtual microphone patterns as shown in Figure 5.28 and Figure 5.29 respectively. Figure 5.28 Energy and Velocity Vector Analysis of a 4th Order Ambisonic decoder for use with the ITU irregular speaker array, as proposed by Craven (2003). Figure 5.29 Virtual microphone patterns used for the irregular Ambisonic decoder as shown in Figure 5.28. The method Craven used to derive this new decoder is not detailed in his paper, and he has opted for a frequency independent decoder, no doubt, in order to make the panning law easily realisable on current software/hardware platforms. It can be seen that the performance of the high frequency energy - 148 - Chapter 5 vector analysis is very good, with respect to the vector length, however, the matching of the high and low frequency vector angles is not ideal, and also the vector length of the low frequency velocity vector should be designed as close to 1 as possible (Gerzon & Barton, 1992). These problems are mostly due to the fact that a frequency independent decoder has been presented, so any decoder will always be a compromise between optimising for the energy vector and optimising for the velocity vector’s three fitness parameters of length, perceived direction, and perceived amplitude. However, using the Tabu method just described, it is a simple matter of changing the weightings of the fitness equations, as shown in equations (5.8) and (5.9), in order to design a decoder with more coherent lateralisation cues. In order to experiment with higher order decoder optimisation, a new Tabu search application was developed, using the same fitness criterion as before, but with user editable weighting functions. A Screenshot of this can be seen in Figure 5.30. Figure 5.30 Screenshot of the 4th Order Ambisonic Decoder Optimisation using a Tabu Search Algorithm application. The sets of up/down arrows in the ‘Fitness Calculation’ box are where the user can set the weightings of each of the individual fitness values, in order to - 149 - Chapter 5 influence the performance of the Tabu search algorithm. It can be seen, in Figure 5.30, that the perceived volume fitness is governed by the Energy (‘En Vol’, high frequency) rather than the pressure (‘Vel Vol’, low frequency). Due to the frequency independent nature of these decoders, one or the other must be chosen, and as the energy vector covers a much wider frequency band for a centre listener (>700 Hz) and an even larger frequency band for off-centre listeners, it is always advisable to use the average energy as an indicator for the perceived amplitude of a decoded source (Gerzon, 1977a). Figure 5.31 Graph showing polar pattern and velocity/energy vector analysis of a 4th order decoder optimised for the 5 speaker ITU array using a tabu search algorithm. Figure 5.31 shows a 4th order decoder optimised by the Tabu search application shown in Figure 5.30. It can clearly be seen that although the length (and therefore, shape) of the energy vector plot is very similar to that of Craven’s decoder shown in Figure 5.28, showing a similar performance, this Tabu search optimised decoder shows improvements in other aspects: • • The low frequency velocity vector has a length much closer to 1 for a source panned in any direction. The low and high frequency perceived directions are in better agreement. The optimisation of a 4th order decoder as proposed by Craven (2003) shows the robust and extensible nature of the tabu search algorithm described in this report, as over double the number of alterable parameters (23 as opposed to 9) were used in this program. - 150 - Chapter 5 5.3.5 Validation of the Energy and Velocity Vector It can be seen in Figure 5.26 and Figure 5.27 that, according to the velocity vector, it is possible to design a low frequency decoder that satisfies all of the fitness parameters discussed in the previous section. This is even possible when the ITU standard speaker layout is used (although the high frequency decode suffers, theoretically, in this configuration) as shown in Figure 5.32. If we take the velocity vector as a measure of the low frequency localisation, which is dominated by time/phase differences between the ears, and the energy vector as a measure of the mid frequency localisation, which is dominated by level differences between the ears, then this theory can be tested using head related transfer functions (Wiggins et al., 2001). The HRTF data is used from (Gardner & Martin, 1994). Assuming the head will remain pointing straight ahead, the speakers will remain in a fixed position in relation to the head and time and level difference plots can be obtained. Figure 5.32 A decoder optimised for the ITU speaker standard. Using the average group delay between 0 and 700Hz to obtain the time differences between the ears and the average magnitude between 700Hz and 3 kHz, reference plots can be calculated, which the decoder’s output must follow in order to fool the ear/brain system successfully. The head related transfer functions for the Ambisonic array can be calculated in one of two ways: • A pair of HRTFs can be applied to each speaker’s output, and then left and right ear responses are summed resulting in a single response pair (for each encoded direction) - 151 - Chapter 5 • The decoder can be encoded into a pair of HRTFs for each input signal (W,X and Y in this case) using the method described in section 5.2.4 Both of the above methods ultimately arrive at the same results and if only offline analysis is needed, then either of these methods can be chosen (the 2nd is computationally more efficient if auralisation of the decoder is desired (Wiggins, et al., 2001) and becomes more efficient the greater the number of speakers used, when compared to the 1st method). Two resulting pairs of HRTF responses have been produced for encoded sources all around a listener, one pair for the low frequency decoder, and one pair for the high frequency decoder. A graph showing the level and time differences of real and Ambisonically decoded signals is shown in Figure 5.33 (note, an Ambisonic decode to a five speaker rig is often referred to as G format). The HRTF analysis graphs have been constructed using the anechoic HRTFs measured by MIT (Gardner B., Martin K., 1994). A real source is taken as a single pair of these HRTFs, and the Ambisonic (G-Format) output has been constructed from a combination of these anechoic HRTFs weighted to various degrees depending on the simulated source direction (i.e. a simulation of an Ambisonic decode). When using the HRTF analysis, the low frequency range was 0 Hz – 700 Hz, and the mid frequency range was from 700 Hz – 3 kHz. The 700 Hz value was used so the results could be directly compared to the velocity and energy vector analysis used by Gerzon & Barton (1992) with the 3 kHz value used as a nominal value. The x-axis scale in these graphs represents either a real or synthesised Ambisonic source position in degrees. The y-axis scaling represents either the average time difference (in samples, sampled at 44.1 kHz) or the average amplitude difference, measured linearly, with an amplitude of one representing 0 dB gain. - 152 - Chapter 5 Time Difference (samples) LF Time Difference 40 20 0 -20 -40 0 50 100 150 200 250 300 350 400 HF Amp Difference Amplitude Difference 2 0 -1 -2 Figure 5.33 G Format Real Source 1 0 50 100 150 200 250 300 Encoded Source Position (degrees) 350 400 A graph showing real sources and high and low frequency decoded sources time and level differences. This graph shows two interesting points. The low frequency, time difference, graph indicates that the decoded material is not perfect, showing a significant error around the rear of the system’s decoded sound field. This is, of course, understandable as there is a speaker ‘hole’ of 1400 between the two rear speakers; however, this fact is not apparent from the velocity vector analysis. The high frequency amplitude differences are a very good fit to the real source’s curve, even when a source is to be reproduced around the rear of the listener. The fact that the two vector analysis techniques perform slightly differently is not wholly unexpected, as these two ideas were taken from a number of sources and converted into part of a psychoacoustic meta-theory by Gerzon (1992c). In order to analyse the robustness of the calculated coefficients, head rotation must be simulated. As the set of HRTFs used for the auralisation and analysis of the Ambisonic decoders are taken using a fixed head, head rotation is achieved by moving the speaker sources around the listener (which is, essentially, the same thing). This more complex relationship between the real and virtual source’s localisation cues can then be observed. A well designed decoder will have localisation cues that follow the changing real - 153 - Chapter 5 cues as closely as possible, where as a decoder that does not perform as well will exhibit various artefacts, such as the virtual source moving with the listeners as they rotate their head in any one direction (in the horizontal plane in this example). Figure 5.34 shows a graphical representation of two sets of decoder coefficients that solve the energy and velocity vector equations (as good a fitness value as possible). It can be clearly seen that the low frequency decoder (that we shall concentrate on here) has different virtual microphone responses for each of the decoders even though the decoders’ performance analysis using the velocity vector gives an identical response for each coefficient set. To make a more detailed comparison between these two sets of coefficients we can use the HRTF simulation described above. Coefficient Set 2 Coefficient Set 1 HF Virtual Mic Polar Pattern HF Virtual Mic Polar Pattern LF Virtual Mic Polar Pattern LF Virtual Mic Polar Pattern Velocity and Energy Vectors Figure 5.34 Velocity and Energy Vectors Graphical representation of two low/high frequency Ambisonic decoders. Figure 5.35 shows that coefficient set 2 has a better match of the low frequency time difference parameter, when analysed using the HRTF data, than coefficient set 1. However, this does show up a shortcoming of the energy and velocity vector technique. As mentioned already, a number of solutions can be found that satisfy the energy vector equations, and a number of solutions can be found that satisfy the velocity vector equation. Once a - 154 - Chapter 5 good set of coefficients have been produced it has previously been a case of listening to the resulting decoders and subjectively deciding which one is ‘best’. Coefficient Set 1 Coefficient Set 2 LF Time Difference : 0 degrees 40 G Format Real Source 20 0 -20 -40 0 50 100 150 200 250 300 350 400 HF Amp Difference 2 G Format Real Source 1 0 -1 -2 0 50 Figure 5.35 100 150 200 250 Source Position (degrees) 300 350 400 Amplitude (frequency domain) Time Difference (samples) Amplitude (frequency domain) Time Difference (samples) LF Time Difference : 0 degrees 40 G Format Real Source 20 0 -20 -40 0 50 100 150 200 250 300 350 400 HF Amp Difference 2 G Format Real Source 1 0 -1 -2 0 50 100 150 200 250 300 350 400 Source Position (degrees) HRTF simulation of two sets of decoder. However, if we continue the HRTF simulation, the effect that head rotation has on the reproduced sound field can be observed (see Figure 5.36). In anechoic circumstances, simulating a change of head orientation and a rotation of all the speaker positions are actually the same thing. So in order to accurately simulate head movement, all the speakers are rotated. This should have the effect of the time and amplitude difference graphs cyclically shifting when compared to Figure 5.35. Any difference in the graphs apart from the cyclic shift is in error with what should be happening (and what can always be seen in the graphs with regards to an actual source). Observing Figure 5.36, it can be seen that head movement introduces errors to the mean time and level differences presented to a listener in anechoic circumstances. The low frequency time difference results are similar in error, but a difference can be clearly seen. Coefficient set 1’s low frequency plots stay faithful to a real source’s time difference. However, the second set of coefficients does not behave as well as this. If we look at the real and virtual source shown at 00 on the graphs (representing where the listener is facing, which will now be an off-centre source due to the rotation of the speakers) the virtual response should follow that of a real source. That is, a source at 00 should now have an off-centre response as the speakers have rotated (again, which is the same as head rotation in anechoic circumstances). - 155 - Chapter 5 Coefficient Set 1 Coefficient Set 2 Figure 5.36 HRTF Simulation of head movement using two sets of decoder coefficients. - 156 - Chapter 5 This is not the case for the 2nd set of coefficients and it can be seen that as the head is rotated, the virtual source’s time difference stays at approximately 0 samples difference. This means that when the head is rotated, the virtual sound source will track with the listener, potentially making the resulting sound field confusing and unstable. 5.3.6 HRTF Decoding Technique – Low Frequency The evidence gathered from the HRTF analysis of the decoders’ performance under head movement suggests that, as far as the low frequency velocity vector is concerned, more information is needed to design a decoder that is both stable under head rotation and has accurate image localisation. However, as the velocity vector is used as an approximation to the interaural time difference, it is now possible to alter the Tabu search algorithm described in section 5.3.4 to ignore the velocity vector and deal directly with the interaural time difference present for encoded sources around the unit circle. This, on its own, may lead to potential performance increases as the interaural time difference for a listener looking straight ahead can be mapped more accurately using HRTF data, when compared to the velocity vector theory. Also, head rotations can be simulated as shown above, and these results taken into account when evaluating the fitness of a particular decoder. So, as is immediately apparent, the actual Tabu search algorithm will remain the same (decoder still has same number of coefficients etc.) but the algorithm that supplies the Tabu search with its fitness coefficient must be altered to take advantage of this new research. ⎛ ⎜ Fitness = ∑ ⎜ m =0 ⎜ ⎝ 12 k k ⎞ ⎛ dφ ref dφ dec ⎜ ⎟ − ∑ ⎜ ∂ω ⎟⎠ k = 0 ⎝ ∂ω 360 2 ⎞ ⎟ ⎟ 13 ⎟ ⎠ (5.10) k = Source angle φ=Average Phase Response (0-700Hz) ω=Frequency m=Head Rotation number. - 157 - Chapter 5 The fitness is now calculated using Equation (5.10) and then combined with the pressure level (volume) fitness given in Equation (5.8) using the root mean square value. Again, the closer this fitness value is to 0, the better the performance of the decoder coefficients. In order to take into account head movement, this equation is evaluated using speaker rotations from 00 to 600 in 50 increments, and then the average fitness is taken. LF Time Difference : 0 degrees LF Time Difference : 30 degrees 40 40 G Format Real Source 20 0 0 -20 -20 -40 0 50 100 150 200 250 300 350 G Format Real Source 20 -40 400 0 50 100 LF Time Difference : 0 degrees 200 250 300 350 400 40 G Format Real Source 20 0 0 -20 0 50 100 150 200 250 300 350 G Format Real Source 20 -20 -40 150 LF Time Difference : 30 degrees 40 -40 400 0 50 100 150 200 250 300 350 400 LF Time Difference : 60 degrees 40 G Format Real Source 20 0 -20 -40 0 50 100 150 200 250 300 350 400 LF Time Difference : 60 degrees 40 G Format Real Source 20 0 -20 -40 0 50 100 150 200 250 300 350 400 Figure 5.37 Comparison between best velocity vector (top) and a HRTF set of coefficients (bottom). Figure 5.38 Polar and velocity vector analysis of decoder derived from HRTF data. In terms of the low frequency decoders that this technique produces, there is a very high correlation between this HRTF method and the previous velocity - 158 - Chapter 5 vector analysis. That is, a decoder calculated using HRTF data produces a good velocity vector plot as shown in Figure 5.38. However, it can be seen that, in order to maintain the image stability due to head rotations, a compromise is needed between the accuracy of the decoder’s localisation (according to the velocity vector) and its image stability under head rotations. To see if this is actually the case Figure 5.37 shows the HRTF analysis of the best velocity vector decoder (as used in Figure 5.36) and a set of decoder coefficients derived using HRTF data. It can be seen that the resulting plots are almost identical for each reproduced angle and degree of head rotation (00,300 and 600 in this case). The HRTF derived set seems to actually have a better fit than the velocity vector analysis suggests, and a slightly better fit than the original velocity vector decoder (which was found to be the best of several found using the velocity vector technique). So, as the decoder is now calculated taking head rotation into account, every decoder now produced using this technique (as there are, again, multiple solutions) will have an analytical performance similar to that shown in Figure 5.37. 5.3.7 HRTF Decoding Technique – High Frequency As already stated (and as can be seen in Figure 5.36), the decoder’s high frequency response is much more difficult to match to that of a real source, and most decoders derived using the energy vector theory have a response to head rotations very similar to those shown in Figure 5.36. However, as is shown in the listening test later in this chapter, although decoders can be designed using HRTF data directly, taking head rotations into account, this will not necessarily result in decoders that perform better under head rotations than when designing a decoder using the energy vector analysis. It decoder designed using velocity and energy vectors can, clearly, still have a good response to head rotations, it is just that it is not due to the Tabu search algorithm striving for this behaviour. However, when utilising velocity/energy vector optimisations, the head rotation parameter can still be used in order to differentiate between decoders’ performance as many resulting decoders are possible. - 159 - Chapter 5 The algorithm used to calculate the fitness parameter for the higher frequency decoder actually needs to be of a slightly different nature than that of the low frequency system. This is due to the fact that after analysing the high frequency lateralisation cues of many optimum decoders (optimum, in that they were optimised using the energy/velocity vector methods, or using purely front facing HRTF optimisation) it was found that, due to the non-uniformity of the speaker layout, high frequency head turning is more catastrophic for the amplitude cue when compared to the low frequency phase cue. If the average fitness were used then the Tabu search would treat optimising the response under head rotation with the same priority as looking straight ahead, possibly resulting in a decoder that performs best when looking 300 to the left, for example. It makes more sense to have priority given to the decoder’s output when the listener is facing straight ahead, and so a weighting term must be used. The equation used for the localisation fitness parameter is given in Equation (5.11). This resulted in HRTF decoders that performed best when the listener is facing straight ahead, as if the weighting parameter was not used, the Tabu search algorithm would converge on decoders with a poor, analytical, performance (i.e. the fitness function did not truly represent the fitness of the decoder as a small increase in fitness when facing off-centre made more of a difference when compared to a centrally facing listener). Fitness = ∑ ( f (k ) 360 k =0 − f (k ) dec ) 2 ref (5.11) f=average magnitude response between 700-3000Hz of a real source (ref) at k0 from centre front, and a decoded source (dec) located at k0 from centre front. k=source angle (in degrees). - 160 - Chapter 5 5.3.8 Listening Test 5.3.8.1 Introduction In order to try and quantify any improvement that can be attributed to the optimisation techniques described above, listening tests are needed. Although the main body of this report concentrates on the numerical analysis and optimisation of Ambisonic decoders using the lateralisation parameters and velocity and energy vectors, a number of small listening tests were developed in the hope that others will carry the work further now that a technique for optimising irregular Ambisonic decoders has been made available. When designing the listening tests, there are two main types of material that can be presented to the listener: 1. Dry, synthetically panned material 2. A pre-recorded real event in a reverberant space. Each one of these extremes will result in a test that will be more suited to testing for one attribute more than another. As an example, if the recent Project Verdi test is observed (Multi Media Projekt Verdi, 2002), two recordings in a reverberant space were used, with the following attributes tested in the questionnaire: a. Subjective room size – very big to small b. Localisation accuracy – very good, precise to bad c. Ensemble depth – deep to flat d. Ensemble width – wide to narrow e. Realism of the spatial reproduction – very good, natural to bad, unnatural f. Personal preference – very good to bad. These are typical of the types of spatial attributes tested when listening to prerecorded material (although others, suggested by Berg & Rumsey, 2001, could be envelopment, presence and naturalness). This type of material is hard to test, in some ways, as it does depend on what you are expecting the tested system to achieve. For example, accurate scene-capture and ‘best - 161 - Chapter 5 sounding’ are not necessarily synonymous; ultimately, the personal preference parameter may be of greater importance. Conversely, the most common form of test carried out on a dry, synthetically panned source is that of simple recognition of the angular placement of that source (again, see Moller et al., 1999, Kahana et al., 1997, Nielsen, 1991 and Orduna et al., 1995 as typical examples). However, evaluating other attributes can often lead to a fuller picture of what a particular system is achieving. Such attributes could include: g. Source width/focus h. Source distance i. Source stability (with respect to head movement). When it comes to testing a surround sound system, the ideals of the system are easier to decide upon. The best case scenario would be a system which: • • • Has small image width/good image focus Reproduces distance accurately Reproduces sources in a fixed position, regardless of listener orientation. Also, not mentioned at this point is that the performance of any multi-speaker system in an off-centre position can also be assessed using any/all of the above points. As far as the optimisations of the Ambisonic decoder are concerned, the direct consequences should be (with regard to an Ambisonically panned, dry source): • • Increased accuracy/matching of encoded source position to perceived source position. Increased image stability with respect to head turning. Other effects of the optimisation may be (again, with regard to an Ambisonically panned, dry source): • • Change in perceived image width/focus. Timbral alteration due to differences between low and high frequency decoders - 162 - Chapter 5 All of the above would also be true when listening to pre-recorded, reverberant material, with potential increase in accuracy and coherency of the lower order, lateralisation cues, resulting in improvements to the higher order spatial properties of the reproduced audio environments: • • • Envelopment should be increased, that is, the sense of being in a real place, and not listening to an array of speakers. Spaciousness should more closely resemble that of the actual event. Depth perception should be more accurate. To this end, in order to subjectively test these decoders, questions based around these attributes should be designed. 5.3.8.2 Decoders Chosen for Testing A small sample listening test was carried out to give an insight into which specific decoders worked best, and also to observe any common features with Ambisonic decoders designed for use with an ITU 5 speaker array in order to influence further listening tests to be carried out after this research. Five decoders were chosen for this test comprising: • One decoder using the default settings of the commercially available SoundField SP451 Surround Processor (SoundField Ltd., • • n.d. a). Two decoders optimised using the energy and velocity vector. Two decoders optimised using HRTF data directly An analysis of these decoders will now follow, using both the energy and velocity vector and the HRTF decomposition methods described above. - 163 - Chapter 5 Figure 5.39 Decoder 1 – SP451 Default Settings Figure 5.39 shows the default settings of the commercially available SP451 Surround Processor unit. This decoder is frequency independent (i.e. both high and low frequency decoders are the same), with all the virtual microphone polar patterns being of type cardioid. This leads to various problems when the decoder is viewed using energy and velocity vectors, with the resultant lengths of the vectors being suboptimal, and all of the source positions being shifted forwards (i.e. a source that should be at 450 will be reproduced closer to around 200 when decoded). However, when the resulting HRTF analysis is observed, the high frequency amplitude differences are a surprisingly good match to that of an actual source, with the low frequency time difference showing the greatest error. - 164 - Chapter 5 Figure 5.40 Decoder 2 – HRTF Optimised Decoder Figure 5.41 Decoder 3 – HRTF Optimised Decoder Figure 5.40 and Figure 5.41 show two examples of decoders optimised using HRTF data directly. It can be seen that these two decoders have produced similar results when looked at using the HRTF data directly and when using the velocity and energy vector analysis, although the virtual polar patterns for both high and low frequency decoders are quite different. Also, the two types of analysis show good agreement as to the angular distortion introduced by Decoder 3, with frontal sources not producing enough level difference between the ears, and so pushing sources towards the front of the speaker - 165 - Chapter 5 array. Decoder 2 has a much better encoded/decoded source position agreement which is, again, shown in both the HRTF and velocity/energy vector analysis at high frequencies, with very similar performance, again using both forms of analysis, at low frequencies. Figure 5.42 and Figure 5.43 show the two decoders that were designed using the velocity and energy vector theories. One thing to note, firstly, is that these decoders were optimised using rear speaker positions of +/- 1150 instead of the usual +/- 1100. Unfortunately this was not noticed until after the listening test was carried out, but this is why the low frequency velocity vector match is not as good as those shown in Section 5.3.4. Again, both of these decoders have quite different low frequency virtual microphone polar responses, but have near identical velocity vector responses. However, if the HRTF data is looked at, it can be seen that Decoder 4’s low frequency phase differences can be seen to have significant errors around the rear of decoder’s response, showing a ‘flipping’ of the image cues at source positions of 1600 and 2000. The high frequency decodes were designed using slightly different criterion, with the angular accuracy of Decoder 4’s energy vector reproduced angle being given a slightly smaller weighting, resulting in a higher error in the reproduction angle for the rear of the decoder, but with the localisation quality (vector length) benefiting from this approach. - 166 - Chapter 5 Figure 5.42 Decoder 4 – Velocity and Energy Vector Optimised Decoder Figure 5.43 Decoder 5 - Velocity and Energy Vector Optimised Decoder - 167 - Chapter 5 Figure 5.44 Comparison of low frequency phase and high frequency amplitude differences between the ears of a centrally seated listener using the 5 Ambisonic decoders detailed above. Although the HRTF analysis of the various decoders has been shown, no mention has yet been made of the performance of each decoder, numerically, with respect to head turning. Figure 5.44 shows each decoder’s performance, when compared to a real source, with respect to a listener turning their head from 00 (facing straight ahead) to 500. It can clearly be seen that all optimised - 168 - Chapter 5 decoders perform in a very similar manner at low frequencies, with even the unoptimised decoder performing in a coherently incorrect fashion (i.e. it does not seem to exhibit the image tracking of a frontal source, for example, as described in section 5.3.6). However, as it is to be expected, the high frequency decoders do not perform as well. Figure 5.45 shows the lateralisation cue errors as absolute error values, with Figure 5.46 showing the average error value for each decoder with respect to head turning. Figure 5.45 Graphs showing absolute error of a decoder’s output (phase and level differences between the ears of a centrally seated listener) compared to a real source, with respect to head movement. - 169 - Chapter 5 Figure 5.46 Graph Showing the Average Time and Amplitude Difference Error with Respect to A Centrally Seated Listener’s Head Orientation. Figure 5.46 shows, in a very simplified manner, how each decoder will perform. Using this graph as an indicator for overall performance, it can be seen that, as already mentioned, all of the decoders perform almost equally as well with respect to low frequency phase cues, with Decoder 1 having, by far the worst error, but, as already mentioned, an error that stays reasonably consistent with head turning. However, it is the high frequency plots that give more insight into the performance of any decoder, as it is the high frequency decoder that is most difficult to optimise, using either energy vector of HRTF techniques. Performing best, here, is Decoder 2, which was designed with head turning as a parameter (although, only up to 30 degrees). However, the decoder with the next best high frequency error weighting is Decoder 5 which is a decoder designed using the energy and velocity vector principles. It must also be noted that, although the decoders all seem to perform similarly (under numerical analysis), looking at the low frequency errors it can be seen that, again, decoder 5 performs very well (best, in fact), but decoder 2 at low frequencies is one of the worst performing decoders (ignoring Decoder 1). Although there are four optimised decoders tested, each low frequency and high frequency decoder was designed separately. No criteria has yet been - 170 - Chapter 5 set for deciding which low frequency decoders will complement particular high frequency decoders and so the decoders have been paired randomly (although always grouped with a decoder that was optimised in the same way, that is, using either HRTF or velocity/energy vector methods). 5.3.8.3 Listening Test Methodology For the actual listening test, two separate testing methods were chosen: • • A listening test similar to that described in section 5.2, measuring the accuracy of panned, mono sources in the decoded sound field. A test where users give a preference as to which decoder performs best when auditioning reverberant, recorded material. These two styles of testing are not designed to be all-encompassing, but have produced interesting points for use in further testing methodologies. Two sources were chosen for the listening tests to be carried out. The source that was to be synthetically panned was dry, female speech which is often used in such tests (for example, see Martin et al., 2001, Kahana et al., 1997, Moller et al., 1999 and Neilsen, 1992) due to its wide frequency range, and reasonably un-fatiguing sound (especially when compared to band-limited noise and other such sources). For the test of a real recording where decoder preference was to be given by a 60 second excerpt from a recording made by the company, Serendipity (2000), of Rick Wakeman playing the piano in Lincoln Cathedral. It is a very reverberant recording made by a company that has had significant experience with the SoundField Microphone, particularly in the effective placing of the microphone (something that can often be overlooked when choosing recorded material). For this small test, three listeners were used. All three were experienced listeners that had taken part in multi-channel sound system listening tests before. The first test had sources presented to them, six source positions per decoder. The source positions were identical for each decoder, but played in a pseudo-random order. The listeners were asked to indicate in which direction they thought the source was coming from and to give an indication of source width. This was to be recorded on the sheet shown in Figure 5.47 - 171 - Chapter 5 which showed the layout of speakers in the University’s Multi-Channel Research Lab. In addition, to aid in the recording of source position, each speaker in the lab had a label fixed on it with its angular position relative to straight ahead. They were asked to draw the size of the source, as this method has proved to be more intuitive in these situations (Mason et al., 2000). Figure 5.47 Sheet given to listening test candidates to indicate direction and size of sound source. The user interface for controlling the listening test was constructed in Matlab, which called Simulink models that encoded and decoded the mono sources in - 172 - Chapter 5 real-time, taking in a direction parameter that had been pre-entered. A screen shot of this user interface is shown in Figure 5.48. Figure 5.48 Screenshot of Matlab Listening Test GUI. 5.3.8.4 Listening Test Results The listening test results showed reasonably subtle differences between the different decoders when tested using the synthetically panned source, and much more obvious differences when listening to a more complex, recorded, sound field. Figure 5.49 shows the results for the three listeners. The square data points represent the recorded source position with the error bars, above and below these positions showing the recorded source size for each decoder. It is difficult to analyse these graphs directly, but it can be seen that all of the decoders seem to perform reasonably well in this test with no image flipping becoming apparent, although two sources were recorded as coming from more than one location, subject 1 – decoder 4 and subject 3 – decoder 1. Interestingly these were both at source position 2250, which is the area where the decoders will all perform at their worst (i.e. at the rear of the sound field). - 173 - Chapter 5 Figure 5.49 Graphs showing the results of the panned source part of the listening test for each subject. ‘Actual’ shows the correct position, D1 – D5 represent decoders 1 – 5. In order to compare these results more equally, the average absolute angle error and image size can be seen for each subject in Figure 5.50. As is to be expected, the image source’s graphical depiction of size is different for each subject (Mason et al., 2000), with subject one generally recording smaller image sizes than subjects 2 & 3. It would be reasonable to insert actual source positions in order to record some form of ‘calibration’ size source for each listener, but this was not attempted in this small test. Another obvious result is that decoder one seems to perform worst, subjectively, according to each subject (i.e. high mean error value). This was an expected result. The other results, however, are slightly more varied from listener to listener. It was proposed in section 5.3.8.2 that decoders 5 and 2 would be expected to perform best, taking into account head turning and the average localisation error this would produce. However, only subject 1 seemed to agree with this statement in its entirety. Decoder 5 did perform consistently well throughout this phase of the test, but decoder 2 performed less favourably when the results of subjects 2 and 3 are observed. - 174 - Chapter 5 Figure 5.50 Graph showing mean absolute perceived localisation error with mean source size, against decoder number. There are a number of potential reasons for this: • • Subject 1 was the most experienced listener in this test, and may give the most correct, or predictable results. Decoder 5 is located at the end of the test, and the subjects may be changing the way they are grading the results (or learning how to interpret them better) as the test continues. This may be corroborated by the general downwards slope that subjects 2 and 3 • show in their average error results. The low and high frequency decoders interact in some more complex, non-linear way than has been simulated in the previous analysis of the decoders (i.e. the low and high frequency decoders should not be designed and analysed in isolation). Figure 5.51 shows the average absolute error and image size for each decoder. It must be noted that, as the image size for each subject has not been normalised, the image size ratios of subject 1 (from decoder to decoder) - 175 - Chapter 5 will have less of an effect than that of subjects 2 and 3. However, the average absolute localisation will not be affected. Figure 5.51 Graph showing the mean, absolute, localisation error per decoder taking all three subjects into account. Figure 5.51 shows that, overall it is decoder 5 that seems to perform best in this test, with the downwards slope, starting with decoder 1, being clearly evident in this figure. Also evident is the already mentioned, relatively equal performance of all of the optimised decoders, with an average error between 100 and 160 compared to decoder 1’s average error of 210. Other non-recorded observations were also evident from this test, and are listed below: • • Head movement helped greatly in the localisation of sources in this experiment, and were used extensively by each listener. It was noted that although front and side sources were generally very stable (an impressive result by itself, when compared to amplitude panned material or the observations of Craven’s higher order decoder (Craven, 2003)), rear images only performed correctly when facing forwards. That is, when the subject turned to face the source, the two rear speakers were perceivable as sources. In these cases • all subjects recorded the position facing forwards. Front and side images were generally perceived at the same distance as the speakers, whereas rear images were perceived - 176 - Chapter 5 much closer to the head, almost on a line joining the two rear speakers of the ITU speaker array. The rear image problems are not wholly unexpected as it can be seen that rear images due to head turning and analysis using the velocity/energy vector methods all point to rear images performing less well. However, the fact that rear images can be formed at all, with a speaker hole of 1400, is still an impressive result. The 2nd part of the listening test was the auditioning of a 60 second except of a piano recording made in Lincoln Cathedral. Each listener heard each decoder’s representation of this piece once and was then invited to call out which versions they wished to hear again. This was continued until a preference was given as to which decoder they thought performed best. The results of this test were as follows: Preference Best Worst Table 5.1 Subject 1 Subject 2 Subject 3 1st 3 3 3 2nd 5 2 5 rd 3 2 5 4 4th 4 4 2 5th 1 1 1 Table showing decoder preference when listening to a reverberant, prerecorded piece of music. The results showed a clear trend, showing that decoder 1 was by far the worst of the five decoders, but with decoder 3 clearly being preferred by all three listeners. This decoder, although not performing as well under head-turning analysis, is the only optimised decoder to have significant shifting of sources towards the front, when looking at Figure 5.41, as shown in both the energy vector and HRTF analysis at high frequencies. This is not the same as just using the forward dominance control as decoder 3 maintains the overall perceived volume equally from all directions. This, therefore, could be a more subjective, artistic artefact of this decoder, although comments from the subjects did indicate some of the reasons for choosing this decoder: - 177 - Chapter 5 • Subjects 1 & 2 commented that decoders 5 & 2 (which they rated 2nd and 3rd, and 3rd and 2nd respectively) were very similar in performance, both with a slightly ‘oppressive’ sweet spot. This, interestingly, disappeared when auditioned off-centre. Decoder 3 • • did not suffer from this. Subject 1 mentioned that decoder 4 had a very wide, more diffuse image. All agreed that decoder 1 was very front heavy, with an obvious centre speaker, and 2 subjects mentioned that it was almost ‘in- • head’ at the sweet spot, when compared to the other decoders. Subject 1 commented that the Piano, when reproduced using decoder 3, had a very ‘tangible’ quality to it. 5.3.8.5 Listening Test Conclusions The listening test, although only being presented to a very small number of subjects, was a useful exercise, bringing to light a number of attributes that should be researched further. The most obvious result was that the unoptimised decoder, based on the standard settings of the commercially available B-Format decoder, clearly performed less-well in both of the tests. This shows that both optimisation methods do improve the performance of Ambisonic decoders for a five speaker irregular array. Also, the performance of decoder 5 in the first stage of the listening test (panned source) was also expected, although the differences between the decoders, overall, was more subtle than expected, and a much larger test base would be needed to gain more statistically significant results. However, the fact that the extremes of performance were shown in this small test is a very encouraging result. If this part of the test were to be carried out again a number of changes would be made to try and remove any bias from the results: • The order of presentation of the test decoders would be randomised. This may eliminate the general downward sloping of • the average localisation results observed in subjects 2 and 3. The test would be carried out over more than one day, testing each subject at least twice to try and measure what kind of variations each one was likely to produce. - 178 - Chapter 5 • • More source locations would be used so as to map more accurately the performance of each decoder. Actual sources would be played at random, so that a ‘calibration’ source width is available to judge better the width parameter of • subject’s results. A distinction could be made between source stability and image location by running two separate tests (and allowing separate analyses on the results): 1. Where the subject is asked to face forwards at all times (knowing they will move their head a little, still). 2. Where the subject is asked to face each source before recording its position. Interestingly, the decoder that was unanimously voted as the ‘best’ decoder when listening to pre-recorded material was an unexpected result (however, the decoder perceived as ‘worst’ was not) with the middle group of decoders needing a larger base of subjects in order to gather a statistically significant result. Although this was a very simple test, with only one parameter, it did, indirectly, reveal some valuable insight into the performance of the decoders: • Most listeners are often surprised by the amount of variation that can be achieved just by altering the decoder, with spaciousness and envelopment being altered massively (especially when compared to • decoder 1). The sweet-spot problems with two of the four optimised decoders were particularly interesting, especially as these were, analytically, the best performing decoders. This suggests that over-optimising for a single position may, in fact, be detrimental to the performance of a • decoder. The best sounding decoder may not be the one that is, necessarily, the most accurate. Testing the performance of a decoder using pre-recorded material is far more difficult to grade when compared to the first test. A number of different recordings should be used and tests where the recording situation can be described by the listener and compared against later (i.e. actual source - 179 - Chapter 5 positions, size of room etc.) could be used to try to neutralise the artistic aspect of the decoder’s performance, if necessary. 5.4 The Optimisation of Binaural and Transaural Surround Sound Systems. 5.4.1 Introduction Both the Binaural and Transaural reproduction techniques are based upon HRTF technology and, for this reason, can be optimised using a similar approach. One of the main problems with synthesised (and recorded) Binaural material is that the reproduction is normally perceived as filtered. That is, the listener will not perceive the pinna filtering (and normally the microphone and headphone filtering too) present in the recording as transparent. Possible reasons for this could be that the pinna filtering on the recording does not match the listener’s, or because no head tracking is used: minute head movements can not be utilised to help lateralise the sound source and so the frequency response heard is assumed to be that of the source itself by the ear/brain system. A similar effect is experienced with the use of crosstalk cancellation filters. If a 2 x 2 set of impulse responses are inverted so as to create a pair of crosstalk cancellation filters, then the frequency response of these filters will be perceived, both on and off-axis, even though the theory states that this response is actually compensating for a pinna filtering response. The most logical method of correcting these artefacts is to use inverse filtering techniques. 5.4.2 Inverse Filtering Inverse filtering (which has already been touched upon in Chapter 3) is a subject that is very simple in principle, but takes a little more care and attention in practice. Inverse filtering is the creation of a filter whose response will completely equalise the response of the original signal. The general case is that of a filter that is created to force the response of a signal to that of a target response and is analogous to re-arranging an equation where the answer is already known, where the value of a variable (in this case, a filter) - 180 - Chapter 5 needs to be found. The time domain representation of this problem is given in Equation (5.12). a(n ) ⊗ h(n ) = u (n ) u (n ) h(n ) = a(n ) (5.12) where: a(n) = original response. u(n) = target response. h(n) = inverse filter (to be found). In Equation (5.12) ⊗ represents polynomial multiplication (convolution) and the division represents polynomial division (deconvolution). A much more efficient approach to this problem is to process all of the data in the frequency domain using the Fast Fourier Transform algorithm. This then transforms the polynomial arithmetic into a much quicker point for point arithmetic (that is, the first value of ‘u’ is divided by the first value of ‘a’ and so on). These frequency domain equations are shown in Equation (5.13). a(ω ) × h(ω ) = u (ω ) u (ω ) h(ω ) = a(ω ) where: ω = angular frequency. (5.13) If we were to take a head related transfer function and find the inverse filter in this way the filter shown in Figure 5.52 will be produced. There are a number of artefacts that can be observed, but first it should be noted that the magnitude response of the inverse filter already appears to be just that, the inverse response (mirror image about the 0 dB mark), as given by the equations above (an inverse filter can be thought of as inverting the magnitude and negating the phase as described in Gardner & Martin (1994)). - 181 - Chapter 5 Figure 5.52 Inverse filtering using the equation shown in Equation (5.13) Unwanted audio artefacts can be clearly seen in the time domain representation of the original and inverse signals convolved together (theoretically they should produce a perfect unit pulse if the inversion has been carried out successfully). Also, the inverse filter does not look complete in that it does not have a definite start and end point as can be observed in most filter impulses (this, on its own, however, is not necessarily an issue). The problem seen in the time domain response of the two signals convolved can be quantified if the frequency domain magnitude response is calculated at a higher resolution as shown in Figure 5.53 (the frequency domain plot in Figure 5.52 was calculated with a length equal to that of the filter). Analysis using this higher resolution shows the excessive ripple that has been introduced by this filter. This can be resolved, as in any other type of filter design, using windowing techniques (Paterson-Stephens & Bateman, 2001). However, the impulse response shown in Figure 5.52 is not yet in the correct format to have a window applied. - 182 - Chapter 5 Figure 5.53 Frequency response of the original and inverse filters using an 8192 point F.F.T.. An F.I.R. filter3 is basically a cyclic signal that will wrap around onto itself. This means that when the inverse filter is calculated, the position of the filter (in the impulse space) is not necessarily correct. For example, the envelope of the filter created in Figure 5.52 is shown in Figure 5.54 along with the ideal position of this filter. Figure 5.54 Typical envelope of an inverse filter and the envelope of the filter shown in Figure 5.52. It can be seen in Figure 5.54 that it is desirable for the main impulse to be in the centre of the filter so as to maximise the number of samples given to pre and post delay processing for the sound. It is this main impulse that dictates 3 Finite Impulse Response – a filter with a fixed length that is convolved (polynomial multiplication) with a signal to apply the filter’s time and frequency response onto the signal. - 183 - Chapter 5 the overall time delay introduced by the filter. As the F.I.R. filter can be treated as a continuous wrappable signal, the impulse response can be repositioned by adding a delay to the response that is to be inverted, as shown in Figure 5.54. To move the main impulse to the centre of the filter, a delay of N/2 samples must be added, where N is the length of the target filter, in samples. This technique also has the benefit of improving the frequency response of the filter, as shown in Figure 5.55 (note that due to the extra samples (zero padded) added to the shifted filter, both filters have been calculated using 256 samples). Figure 5.55 Two F.I.R. filters containing identical samples, but the left filter’s envelope has been transformed. It can now be seen that the frequency response of the filter has been improved and much of the rippling has been eliminated. This results in a reduction of the artefacts seen in the time domain version of the original and inverse filters convolved (as shown in Figure 5.52, bottom left plot). This is shown in Figure 5.56. - 184 - Chapter 5 Figure 5.56 The convolution of the original filter and its inverse (both transformed and non-transformed versions from Figure 5.55). Now that the filter is in the correct format, a window function can be applied to smooth the response still further, and help reduce these time and frequency domain artefacts. The windowed response is shown in Figure 5.57. Using a limited filter size, this is the best realisable response without using the regularisation parameter described in Chapter 3. The only method of improving this further is to create a longer response using zero-padding of the filters used to calculate the inverse. However, the resulting size of the HRTF filters must be taken into account as convolution of the inverse filter and the original HRTF filter will cause its response to increase in size. If the HRTF filter is of length ‘a’ and the inverse filter is of length ‘b’ then the resulting filter will be of a length ‘a+b-1’, and the longer the filter, the more processing power will be needed for it’s implementation. The differences between using a windowed 256-point filter and a windowed 1024-point filter are shown in Figure 5.58. - 185 - Chapter 5 Figure 5.57 A frequency and time domain response of the filter after a hamming window has been applied. Figure 5.58 The response of a 1024-point windowed inverse filter. 5.4.3 Inverse Filtering of H.R.T.F. Data When inverse filtering the HRTF data, the only decision that has to be made is which HRTF will be used to equalise the whole HRTF set. Two logical choices are available: • The near ear response to a sound source at an angle of 900 as this will most likely be the filter with the least amount of pinna filtering affecting the response. - 186 - Chapter 5 • The ear’s response to sound directly in front of the listener so that when the sound is positioned at 00, the H.R.T.F. responses at the ears are identical and flat. The 1024-point inverse filters for both of these methods are shown in Figure 5.59. Looking at this figure it can be seen that, in reality, the 00 HRTF is far more ill-conditioned to the inversion process when compared to the 900 response. Some wrapping of the resulting filter can be seen for the 00 response indicating that a longer filter length is desirable. This is to be expected because of the reason stated above (the 900 angle has less head/pinna filtering associated with it) and so it is best to use the 900, near ear, HRTF as the reference response. Figure 5.59 The 1024-point inverse filters using a 900 and a 00, near ear, HRTF response as the signal to be inverted. As an example, a set of H.R.T.F. data has been processed in this way using an inverse filter size of 769-points (so that the convolution of the original with this inverse filter will be equal to 1024-points). Figure 5.60 shows a number of - 187 - Chapter 5 the H.R.T.F. impulses in the time and frequency domain so a comparison of them can be made both before and after inverse filtering. Before Inverse Filtering Figure 5.60 After Inverse Filtering Comparison of a HRTF data set (near ear only) before (right hand side) and after (left hand side) inverse filtering has been applied, using the 900, near ear, response as the reference. Figure 5.60 shows that although both sets of HRTFs still have a pinna filtering effect, the inverse filtered set have a larger bandwidth, in that extreme low and high frequency components of the impulse responses contain more energy, and contain peaks and troughs in the frequency response that are no larger the originals (for example, the 135 degree frequency response plots both have a notch no lower than around -27 dB). These inverse filtered HRTFs are perceived to be of a better fidelity than that of the originals (which have this response due, in some part, to the non-optimum inverse filtering of the source’s response that was used to record the HRTF data in the first place (Gardner & Martin, 1994)). It can also be seen that due to the nature of these new inverse filtered HRTFs, they could also be windowed and shrunk if - 188 - Chapter 5 smaller responses were needed due to processing constraints thanks to the roughly equal amount of pre and post delay filtering (i.e. the highest amplitude parts of the filter are at the middle sample position). 5.4.4 Inverse Filtering of H.R.T.F. Data to Improve Crosstalk Cancellation Filters. As mentioned at the start section 5.4, one of the problems of the crosstalk cancellation system is that very noticeable colouration of the reproduced sound can occur, both due to the crosstalk cancellation itself, and due to the response of the individual parts of the system (usually speaker to near ear, and speaker to far ear responses). This is why there is a difference between crosstalk cancellation in the free field and crosstalk cancellation using HRTF data. However, as discussed in Chapter 3, system inversion using frequencydependent regularisation can be used to compensate for this, at the expense of the accuracy of the crosstalk cancellation at these frequencies. For this reason, it is desirable to minimise any potential ill-conditioning due to the response of the individual components of the system prior to the 2 x 2 matrix inversion process, thus resulting in the least amount of regularisation needed in order to create a useable set of filters. In this way, the inverse technique described in section 5.4.2 will be utilised in much the same way. For example, the system shown in Figure 5.61 will be used as a basis for the creation of a pair of crosstalk cancellation filters. 100 Figure 5.61 System to be matrix inverted. This is a typical arrangement for a crosstalk cancellation system, and is based on a pair of speakers placed at +/- 50 in front of the listener. Using the HRTF - 189 - Chapter 5 set from M.I.T. (Gardner & Martin, 1994) this will give the responses for the near and far ears (assuming symmetry) as shown in Figure 5.61. Figure 5.62 HRTF responses for the ipsilateral and contralateral ear responses to the system shown in Figure 5.61. If a set of crosstalk cancellation filters are constructed from these two impulse responses, using the techniques described in Chapter 3, then the responses shown in Figure 5.63 are obtained (using no regularisation). Figure 5.63 Crosstalk cancellation filters derived using the near and far ear responses from Figure 5.62. It can be seen, from Figure 5.63, that the expected peaks are present. That is, a peak at very low frequencies due, mainly, to the close angular proximity of the speakers and the peaks at around 8 kHz and high frequencies, which appear to be due to the inversion of the responses of the near and far ear HRTFs (as seen in Figure 5.62). When this crosstalk cancelled system is auditioned, not only is a very coloured sound perceived off-axis, but a non-flat frequency response is also perceived on-axis. This is also coupled with a large loss in useable dynamic range as the amplifier and speakers have to reproduce such a large difference in frequency amplitudes. These are mainly - 190 - Chapter 5 because of the reasons stated at the start of section 5.4.1, but also because of the different pinna/head/ear responses observed for different listeners. A more general, yet correct inverse filtering method is needed to correct these problems. If regularisation is to be avoided as a last resort, then the responses shown in Figure 5.62 must be ‘flattened’ using inverse filtering techniques. As it is the difference between the near and far ear responses that is important, the filtering of these two responses will have only fidelity implications so long as the same filter is applied to both the near and far ear response. Also, the least ill-conditioned of the two responses is likely to be the near ear response, as it will have been filtered less by the head and pinna, so it is this response that will be taken as the reference (although, due to the small angular displacement of the speaker, there is little difference between the two filters). The inverse filter of the near ear HRTF is shown in Figure 5.64. Figure 5.64 Inverse filter response using the near ear H.R.T.F. from Figure 5.62. Applying this inverse filter to the ipsilateral and contralateral ear responses shown in Figure 5.62, gives the new ipsilateral and contralateral ear responses shown in Figure 5.65. If these filters are now used in the calculation of the crosstalk cancellation filters (using the 2 x 2 inverse filtering technique with no regularisation), then the filters shown in Figure 5.66 are obtained. - 191 - Chapter 5 Figure 5.65 Near and far ear responses after the application of the inverse filter shown in Figure 5.64 (frequency domain scaling identical to that of Figure 5.62). Figure 5.66 Crosstalk cancellation filters derived using the near and far ear responses from Figure 5.65 (frequency domain scaling identical to that of Figure 5.63). The optimisation of these filters using inverse filtering techniques can be verified by observing the responses shown in Figure 5.66: • The overall response of both of the filters has been flattened with the largest peak above very low frequencies now at around 6dB at around 12.5 kHz, and virtually no peak at very high frequencies, which means • that regularisation is no longer needed at these frequencies. The peak at low frequencies is now solely due to the 2 x 2 matrix inversion and not the response of the ipsilateral and contralateral ear responses, which has reduced this peak from over 30dB to 20dB. This means that, although regularisation is still needed here, a smaller amount can be applied, making the crosstalk cancellation more accurate in this frequency range. - 192 - Chapter 5 • The flattening of the filter responses causes the on-axis response to be • perceived as much flatter (un-filtered) than before. • making off-axis listening seem far less filtered. The flattening of the filter responses also has the added effect of The crosstalk cancellation filters are actually smaller in length than the originals shown in Figure 5.63, even though the contralateral and ipsilateral ear responses used to calculate them were much larger than the originals shown in Figure 5.62. This is due to the fact the new near and far responses are much less ill-conditioned for inversion (the filters do not have to ‘work as hard’ to achieve crosstalk cancellation). These new crosstalk cancellation filters, although much better than filters created using the raw HRTF data, still need to use some regularisation, and still sound a little bass heavy. However, at this point, it is still possible to take the inverse filtering technique a step further. As always, it is the difference between the two ears that is important, especially as the pinna used in the HRTF data is not likely to be the same as that of the listener. So, using inverse filtering, it is possible to design crosstalk cancellation filters that require no regularisation to correct for the conditioning of the system. If the filter representing ‘h1’ is used as a reference, then another inverse filter can be created by inverting the response of ‘h1’. If this inverse filter is convolved with both h1 and h2 then the h1 filter will, in theory, become the unit impulse, and h2 will then be a filter representing the difference between h1 and h2. These filters are shown in Figure 5.67, and Figure 5.68. Figure 5.67 Filter representing inverse of h1, in both the time and frequency domain. - 193 - Chapter 5 Figure 5.68 Crosstalk cancellation filters after convolution with the inverse filter shown in figure 5.51 It can be seen from Figure 5.68 above that h1 has a flat frequency response and h2 now has very little energy over the 0dB point meaning that the system needs no regularisation. These new, double inverted, filters are also perceived as performing much better than the previous crosstalk cancellation filters, with a less muffled sound and clearer imaging. One other highly useful feature of these new filters is that h1 can be approximated by a unit impulse (as this is what h1 should be, theoretically, anyway) which cuts the amount of FIR filtering in the system by a half, replacing the h1 filters with a simple delay line, as shown in the block diagram in Figure 5.69. Z-m Left Ear Signal H2 Filter Right Ear Signal H2 Filter Z-m Figure 5.69 + + + + Left Speaker Right Speaker where m is the delay line length The optimised crosstalk cancellation system However, these double inverted filters do mean that when the speakers are positioned close to each other, the response can be perceived as lacking in bass response when compared to the single inverted case (which is perceived as having a raised bass response anyway). For example, if we inject an impulse into the block diagram shown in Figure 5.69 (but replacing the delay - 194 - Chapter 5 line with the filters again) and compare the results that will arrive at the ear of a listener (although it should be noted that the analysis is using the nonoptimum frequency response of the MIT HRTF data), the results shown in Figure 5.70 can be seen (note that the speakers in the University of Derby’s Multi-channel research laboratory are actually placed at +/- 30, and so filters for this speaker arrangement is shown in Figure 5.70). Figure 5.70 Left Ear (blue) and Right Ear (red) responses to a single impulse injected into the left channel of double and single inverted cross talk cancellation systems. Both responses show a good degree of crosstalk cancellation, in the right ear response, with the single inverted system seeming to perform slightly better. The low frequency roll-off can also be noted in the left ear response of the double inverted system. However, these quantitative results cannot necessarily be taken at face value. For example, the single inverted system (lower plot) is perceived as being bass heavy, although this is not shown in these graphs as it is the non-optimum HRTF data used in this analysis. Also, the double inverted system is perceived as performing better at the higher frequencies, although this, again, is not suggested in this plot. It is also, interesting to look at the same graphs for the +/- 300 case, as shown in Figure 5.71. - 195 - Chapter 5 Figure 5.71 Left Ear (blue) and Right Ear (red) responses to a single impulse injected into the left channel of a crosstalk cancellation system. This plot shows two significant results: • The bass loss is no longer an issue. However this is to be expected as widening the speaker span alleviates the bass boost in the original • filters which, in turn, means they do not need to be inverse filtered. The cancellation of the right ear signal is shown to be around 20dB worse than that shown for the +/- 30 case. This second point is interesting as the crosstalk cancellation filters have been created in exactly the same way as the +/-30 case. This means that the same differences between the filters will be retained. The only absolute in the filtering process is the response due to the pinna alone, and it is this discrepancy that must be causing the problem. These two graphs suggest that the further apart the speakers, the more the pinna matching between the listener and the filters becomes important. This would explain why widening the speakers degrades the localisation quality using this system. 5.5 Conclusions Optimisation techniques have been described, analysed and discussed in this chapter, with the main part of this section concentrating on the optimisation of the Ambisonics decoders. - 196 - Chapter 5 5.5.1 Ambisonic Optimisations Using Heuristic Search Methods The main problem to be tackled in this section was the derivation of Ambisonic decoders for irregular arrays, as, although Gerzon & Barton (1992) had suggested some parameters to be used in the design of these decoders, the solving of these equations was previously a lengthy and difficult process. In the analysis of the original work by Gerzon and Barton (1992 & 1998) it was found that: • • Multiple values could be chosen that would satisfy these equations, analytically performing equally well. The original coefficients suggested by Gerzon & Barton (1992) were actually non-ideal, with an oversight in the way in which the equations were initially solved leading to a mismatch between the low and high frequency decoders’ perceived source position. Various new methods have been devised and implemented in software to solve these problems: • A heuristic search method, based on a Tabu search algorithm, has been developed, along with the fitness functions that need to be satisfied in order to automatically generate decoders for irregular speaker arrays. This method has the three following benefits: o It automatically solves the non-linear simultaneous equations in an optimal way. o Changing the start position for the search will generate a different set of coefficients. o This method solves all the parameters of the equations simultaneously which corrects for the low and high frequency decoder mismatch found in Gerzon & Barton’s method (Gerzon & Barton 1992 and Gerzon & Barton • 1998). An analysis technique based on the use of generic HRTF data has been devised to help differentiate between Ambisonic decoders designed using the above method, using head turning as an additional parameter as phase and level differences will generally be similar for each decoder. - 197 - Chapter 5 The tabu search method has also been shown to work well on the new higher order decoder types, such as the one proposed by Craven (2003), which has far more coefficients to optimise, demonstrating that the Tabu search methodology is easily extendible to more unknowns (either a higher order, or more speakers). The HRTF analysis technique described above was also used to validate the original work by Gerzon & Barton (1992) which then led to the creation of a heuristic search program, with corresponding fitness functions, used to design Ambisonic decoders for irregular arrays using the HRTF analysis technique first proposed in Wiggins et al. (2001) taking into account head turning directly, so reducing the number of decoders produced. The properties of this new technique are as follows: • For a two-band decoder the correlation between decoders designed using the velocity/energy vector methods and HRTF methods are • good. Using the HRTF technique a decoder could be designed using more frequency bands, which is impossible using the previous • velocity/energy vector method. However, the HRTF decoder method is far more computationally expensive and it does take the tabu search algorithm longer to converge on an optimum result, but as this is an off-line process anyway, this is not a major issue. A small listening test was carried out using both synthetically panned material and pre-recorded material in order to help steer future listening tests aimed at optimised Ambisonic decoders. Although only three subjects were used, the decoder that performed worst in both tests was unanimously seen as an unoptimised decoder based on the default settings of a commercially available B-format decoder for the ITU irregular speaker array. However, although many more subjects would be needed to gain statistically significant results, all the optimised decoders performed well, with the expected decoder performing best in the synthetically panned listening test. As expected, there were no great differences between decoders designed using either - 198 - Chapter 5 optimisation method, as the two systems correlate well with respect to coefficients and, in fact, slightly less optimal decoders seemed to perform well when recorded, reverberant material was auditioned by the test subjects. Also, one reported observation was that the most optimal decoders seemed to deliver a more pleasant listening experience slightly off-centre (when compared to the same decoder in the sweet spot), which is an extremely interesting result that needs to be investigated further. In summary, the use of the Tabu search algorithm has resulted in a vast simplification of the process of designing Ambisonic decoders, allowing for the Vienna equations (Gerzon & Barton, 1992 & 1998) to be solved correctly for irregular speaker arrangements (although the software concentrates on a typical five speaker horizontal arrangement). This has then been taken a step further through the use of the HRTF data directly. 5.5.2 Further Work for Ambisonic Decoder Optimisation. Now that the decoder design algorithm can directly use HRTF data the obvious next step is to increase the number of frequency bands. When taking this method to its extreme, this will mean that instead of using cross-over filters, a W, X and Y filter will be created for each of the speaker pairs (or 1 set for the centre speaker). In this way it should be possible to maximise the correctness of both the level and time differences simultaneously for many frequency bands improving the performance of the decoder still further for a centrally seated listener. The software could also be extended to take into account off-centre listening positions which could, potentially, lead to a control over the sweet spot size, trading the performance at the centre, for the performance around this spot. This may well be beneficial, not only to create a ‘volume solution’, but to also circumvent the problems noticed in the listening test with respect to the more optimum decoders, analytically speaking, giving a slightly uncomfortable, obtrusive listening experience directly in the sweet spot. - 199 - Chapter 5 5.5.3 Binaural and Transaural Optimisations Using Inverse Filtering. The use of inverse filtering techniques on HRTF data has proved an invaluable tool in the optimisation of both Binaural and Transaural reproduction. An improvement in the frequency response of the crosstalk cancellation filters has been demonstrated which is apparent both on and off axis from the cancellation position. This reduces the need to use the frequency dependant regularisation function; although at the extreme upper frequencies (where little energy in the HRTF data is present) it is still advisable to use regularisation to stop the excessive boost of these frequencies. It has also been shown how moving the speakers closer together has the effect of improving the analytical crosstalk cancellation figure between the ears of a listener in the sweet spot. This has to be a feature of the pinna filtering mismatches as the differences between the creation and analysis HRTF filters were kept constant, with only the monaural pinna filtering having changed (all the work was based around the same set of HRTF filters and pinna differences between the ears are kept constant). 5.5.4 Further Work for Binaural and Transaural Optimisations. A method to control the amount of inverse filtering that is carried out on the crosstalk cancellation filters must be used as the single inverted filters sound bass heavy, and the double inverted filters are bass light. This can be done by carrying out the following steps: • Create the inverse filter in the frequency domain and split into • magnitude and phase. • the frequency domain and split into magnitude and phase. • ratio, and use the phase from the unit impulse. Create a unit impulse, delayed by half the length of the inverse filter, in Crossfade the magnitude responses of the two filters using the desired Mix the magnitude and phase of this filter back into its complex form and inverse FFT into the time domain. - 200 - Chapter 5 • This will result in a filter that has a linear phase response (that is, pure delay) and a magnitude response can be chosen from flat to the • magnitude response of the inverse filter. Use the above filter as the 2nd inversion filter in the creation process of the crosstalk cancellation filters. Once the above steps have been carried out, listening tests can be carried out to determine which filters are perceived as having the flattest response. 5.5.5 Conversion of Ambisonics to Binaural to Transaural Reproduction Although the conversion from the base format of Ambisonics has been described in Chapter 4, there are still some ongoing issues that have meant that listening tests on this part of the project have not taken place. During this project all of the systems have been looked at separately with main optimisation work carried out on the Ambisonics decodes and the crosstalk cancellation systems. The conversion of Ambisonics to binaural is now well documented (see Noisetering et al., 2003 for the most recent overview) and this, coupled with the inverse filtering techniques described in section 5.4 works well. Similarly, playing a standard binaural recording over the two speaker crosstalk cancelled system described in the same section also works well, with the inverse filtering techniques resulting in a much flatter, un-filtered sound when compared to a crosstalk cancelled system using raw HRTF data. However, when combining these two steps and attempting to reproduce an Ambisonic decode over either a two or four speaker crosstalk cancelled array, suboptimal results are experienced with heavily filtered results perceived. Further work is needed in this area to bring this conversion process up to an acceptable level. However, for further work the following avenues will be investigated: • The use of Bumlein’s shuffling technique in order to convert a coincident recording into a spaced one at low frequencies will be attempted as this will remove the need for Ambisonic to binaural - 201 - Chapter 5 conversion step, and will reduce some of the filtering applied to the • system. The crosstalk cancellation and Ambisonic to binaural conversion steps are taken in isolation; however, the filtering and calculation of crosstalk cancellation filters can be combined by using the Ambisonic to binaural decode function shown in equation (4.3), as the target function for the crosstalk cancellation inversion equation shown in equation (3.13). This will mean that inverse filtering is not needed as the filters response to pinna should, to some extent, cancel each other out, resulting in a less filtered system. - 202 - Chapter 6 Chapter 6 - Implementation of a Hierarchical Surround Sound System. While carrying out this research it became apparent that although the Matlab/Simulink platform was very useful in the auditioning and simulation of surround sound systems, more efficient results (with regards to processor loading) could be achieved, particularly when FIR filtering, if custom programs were written for the Windows platform using the Win32 API. In this chapter the various signal processing algorithms and implementation details will be discussed, so as to build up a library of functions to be used in multi-channel audio applications. The platform specific code will then be investigated so that an audio base class can be constructed, and it is this class that will form the basis for audio applications. Once the necessary background information and techniques have been discussed, an example application based upon the surround sound system described in Chapter 4 will be covered. 6.1 Introduction At the beginning of this research it was assumed that the best platform for the implementation of a system that relied on digital signal processing techniques was one based around a digital signal processor. However, this seemingly logical assumption has now been challenged (Lopez & Gonzalez, 2001). Around ten years ago D.S.P. devices were far faster than home computers processors (Intel, IBM, etc.), but whereas D.S.P. core speeds have been increasing at a steady rate (approximately doubling every two years), the rate of increase of core speed of a P.C. processor is now doubling every year. This has resulted in the processing power available on fast PCs now being greater than that available on more expensive D.S.P. chips (Lopez & Gonzalez, 2001). As much of the testing and algorithm development was - 203 - Chapter 6 already taking place on a PC platform (using Matlab® and Simulink®) it soon became apparent that this platform would be suitable for the final implementation of the system and, in some ways, be far more suited than a dedicated D.S.P. platform. Using the PC as a signal processing platform is not a new idea (Lopez & Gonzalez, 2001; Farina et al., 2001), but has not been viable for surround sound until fairly recently. This is mainly due to the fact that reasonably priced, multi-channel cards (16 or more channels) are now readily available and are not only the perfect test platform for this surround sound project, but also, once the technology is in place, they provide a perfect platform to actually develop surround sound software. It is, of course, also due to the fact that Intel’s Pentium and AMD’s Athlon processors are now very powerful and can easily process over 32-channels of audio in real-time. Therefore, convolving long filters with just a few channels of audio (as in crosstalk cancellation) is not a problem for today’s PCs (assuming efficient algorithms are used, see later in this chapter). So, when it comes to developing such a system, what options are available? • • • Home PC computer (Host Signal Processing). Digital Signal Processor Platform. Hybrid of the two. Each of the systems described above have their pros and cons and each of these methods have been utilised, at some point, during this project. A description of each will be given. 6.1.1 Digital Signal Processing Platform A Digital Signal Processor is basically a fast micro-processor that has been designed and optimised with signal processing applications in mind from the outset (Paterson-Stephens & Bateman, 2001). This means that it generally has a more complex memory structure when compared to a ‘normal’ microprocessor and a more specialised command set. An example of a memory structure used by D.S.P.s is a system is known as dual-Harvard architecture. A standard micro-processor is normally designed around the von Neumann - 204 - Chapter 6 architecture (Paterson-Stephens & Bateman, 2001), and although a thorough investigation into these techniques is not part of the scope of this project, a brief explanation will be given to help differentiate between D.S.P.s and PC micro-processors. Von Neumann architecture is reasonably straightforward, having one memory space, one internal data bus and one internal address bus. All of these components are used in the reading and writing of data to and from memory locations etc.. A diagrammatic view of von Neumann architecture is shown in Figure 6.1. Basically the Internal Address Bus selects what data is to be read/written, and then this is sent to the C.P.U. or A.L.U. for processing along the internal data bus. Internal Address Bus ALU Register File Instruction decode and CPU control I/O Devices Shared Program and Data Memory ALU Internal Data Bus Figure 6.1 A Von Neumann Architecture. A Harvard architecture (see Figure 6.2) based micro-processor (common in D.S.P. devices) has a very similar layout to the von Neumann architecture, except that three memory spaces, three address buses and three data buses are used as follows: one address bus, memory space, and data bus for program memory, one for X data memory and one for Y data memory. This means that the D.S.P. device can access memory more efficiently, being able to read/write up to three memory locations per clock cycle, as opposed to one using Von Neumann architecture. Also, a more complex Address Generation Unit (A.G.U.) is normally included that can handle such things as modulo address (circular buffering) and bit-reversed addressing (used in Fast Fourier Transforms). This is another task that is taken away from the main processor incurring no extra processor overhead. - 205 - Chapter 6 As explained above, it is mainly the architecture of the system that differentiates between a D.S.P. and a PC micro-processor. However, another difference between a D.S.P. and a PC is that a D.S.P. has no ‘operating system’ as such (although specialised real-time operating systems can be employed). That is, each D.S.P. platform is configured for optimal performance using whatever peripherals are used with it. It is not a general, ‘jack of all trades’ with flexibility being the key feature, like a PC. The advantages of not having an operating system will become more apparent when discussing the PC platform. The D.S.P. platform is designed for realtime processing, that is, processing containing no perceivable delay. X Address Bus Address Generation Unit Y Address Bus Program Memory I/O Devices ALU Program Memory Instruction decode and DSP control Y Data Memory ALU Register File X Data Memory Program Address Bus X Data Bus Y Data Bus Program Data Bus Figure 6.2 Diagram of a Harvard Architecture 6.1.2 Host Signal Processing Platform (home computer). A PC (or Apple Macintosh) can be used as a system for carrying out digital signal processing. This is now a viable solution because processors for these platforms are now becoming very fast and the distinctions between the microprocessor and D.S.P. are becoming more blurred as the PC has more lowlevel optimisations for signal processing applications (such as streamed music and video, via the World Wide Web). One of the PC’s biggest assets and potentially largest limiting factors is its operating system. In this project the - 206 - Chapter 6 Windows 2000 operating system was used. This operating system was chosen as it is more stable than Windows 98, is compatible with more software than Windows NT and uses fewer resources than Windows XP. In any case, all these Microsoft platforms use the same API, namely, Win32. Firstly, the reason that the operating system is the PC’s greatest asset is that it’s A.P.I. simplifies many operations on the PC and makes programming graphical user interfaces relatively straightforward (as opposed to generating code to run, say, a separate LCD display). Also, the operating system handles all the calls to peripherals using a standard function set. This means that the programmer does not need to know exactly what hardware is in the machine, but can just quiz Windows as to whether the hardware meets the requirements needed (e.g. it has the correct number of channels available). The operating system also has disadvantages for similar reasons. Windows is a graphical user environment, that is, it is geared towards graphical applications. Audio, of course, is very well supported, but must be accessed using the Windows A.P.I., that is, direct access of the underlying hardware is not possible under Windows. When using this, it is soon noticed that considerable latency can be introduced by both taking audio as an input and passing it out as an output, and although this latency can be specified (within limits), the lower the latency, the more unstable the system. This will be explained in more detail later in this Chapter. 6.1.3 Hybrid System The most user-friendly technique for developing such a system is by using a hybrid system comprising of the two systems mentioned above. This system would not only be a very easy system to develop, but would also be very cost effective as a product, as half of the hardware platform (i.e. the PC) would already be in place. It would include the positive aspects of both of the above systems, with a graphic user interface being programmed and realised on the host PC system, but with the actual processing of the audio stream being handled by the D.S.P. card, meaning that latency is no longer a problem, and tried and tested G.U.I. techniques can be utilised on the P.C. side. Such a system can be devoid of any noticeable latency as the P.C. side is used to just update a few parameters on the D.S.P. card. For example, if a three- - 207 - Chapter 6 dimensional panning algorithm was to be implemented, then the D.S.P. card would handle all of the audio passing through the system, mixing the audio signals together, and passing the sounds to the correct speakers, at the correct levels. The P.C. would be passing just the co-ordinates of where the virtual sources are to be panned to. This also has the benefit of taking some of the processing load off the D.S.P. card, as the P.C. can be used to calculate coefficients, etc. that may rely on computationally expensive floating point calculations, such as square roots and trigonometric functions, with the results passed to the D.S.P. card for use. 6.2 Hierarchical Surround Sound System – Implementation Although, as mentioned above, the hybrid system is the ideal solution for the development of the hierarchical surround sound system, it was not a practical solution for this particular project, mainly due to the cost of the D.S.P. development boards with true multi-channel capability (although such an affordable multi-channel board has now become available from Analogue Devices®). Thus, as much of the testing and investigative work was carried out using a P.C. with a multi-channel sound card (using Matlab, Simulink and a Soundscape Mixtreme, 16-channel sound card), it was decided that this would be the platform used for the realisation of the project’s software. For the explanation of the software application developed as part of this project, this section will be split into two main sub-sections: • • The techniques and algorithms needed for the successful implementation of the system described in chapters 3, 4 and 5. An explanation of the Windows platform, its associated A.P.I.s, and considerations and techniques acquired for this platform specific programming task. 6.2.1 System To Be Implemented. Figure 6.3 shows a simplified block diagram of the proposed hierarchical surround sound system. - 208 - Chapter 6 n-speaker output decoder n-channel carrier Recorded/ Panned Signals Encoding Block Sound-field Manipulations. Rotations etc. 2-speaker transaural decoder 2-channel binaural decoder Figure 6.3 The hierarchical surround sound system to be implemented. It can be seen from this block diagram that the proposed system has a number of distinct sections that consist of: • Recording of input signals, which will be in 1st Order B-format, in this • example. • while in B-Format. Sounds will be able to be manipulated internally (rotated, for example) These four-channel B-Format signals will then be decoded in one of three ways: o Multi-speaker panned output. o 2 or 4 speaker transaural output. o 2-channel binaural output. In order to describe how these functions will be implemented in a C++ environment it is necessary to understand how the Windows operating system will pass the data. • • The sound data will be presented in buffers of a fixed size (a size that is fixed by the application itself). The sound data will initially be passed to a buffer as an 8-bit unsigned (char), although the application will always be dealing with 16-bit • signed integers (short) on the input and output sections. • precision. All intermediate processing will then take place at 32-bit floating point The application will use 8-channels in and 8-channels out from a single sound card. - 209 - Chapter 6 6.2.2 Fast Convolution One of the most processor intensive functions needed in the hierarchical surround sound software is that of convolution which is needed for the binaural and transaural reproduction systems. Also, for accuracy it is desirable for the cross-over filtering, needed in the Ambisonic decoders, to be carried out using F.I.R. filters, as these possess linear phase responses in the pass band (that is, pure delay), and so will cause the least distortion to the audio when the two separate signals are mixed back together (as long as the filter length, and therefore delay, is the same for each of the filters). F.I.R. filters are simple to implement in the time domain (they are the same as polynomial multiplication) but are very computationally expensive algorithms to perform. Filtering of this kind is much more efficiently handled in the frequency domain, thanks to the Fast Fourier Transform algorithm. However, convolving two signals together in the frequency domain is slightly more complex, when compared to its time domain equivalent. To understand why other considerations must be taken into account for frequency domain convolution let us first consider the time domain version of the convolution algorithm. If we have two signals, c and h, where c is the signal to be convolved and h is the impulse response that we will convolve the signal with, the convolution of these two signals is given by Equation (6.1). y = c⊗h y (n ) = ∑ c(n − i )h(i ) 128 i =1 (6.1) where y = result n = sample number i = index into impulse response In the above case, the impulse that is to be convolved with the signal is 128 samples long, and it can be seen that the convolution process works on the past 128 samples of the signal. In programming terms this suggest that this algorithm can be implemented using a circular buffer that is set to store the - 210 - Chapter 6 current sample, and the preceding 128 samples before the current sample. If the impulse is stored in another circular buffer, then the implementation of this algorithm will follow the block diagram shown in Figure 6.4. c(n) z-1 z-1 h(0) z-1 h(1) z-1 h(2) h(i-1) h(i) y(n) + Figure 6.4 + + + + Time domain convolution function. From Figure 6.4 it can be seen that this algorithm will take ‘i’ multiplies and additions per sample which, considering 128 samples represents an impulse response length of 0.003 seconds at a sampling rate of 44.1kHz, would not be suitable for longer impulses. So, how can this algorithm be transferred to the frequency domain? It has already be noted that time domain polynomial multiplication is the same as frequency domain point for point multiplication (i.e. time domain convolution is the same as frequency domain multiplication), and this fact can be used to improve the speed of this algorithm. Taking this into account for a fixed length signal is relatively straightforward. If your original signal is 256 samples long, and the impulse is 128 samples, as long as the F.F.T. size used is longer than the final length of these convolved signals (256+128-1), then both the signals can be transferred into the frequency domain, multiplied, point for point (note that this is the multiplication of complex numbers), and then an inverse-F.F.T. applied. However, if the incoming signal needs to be monitored as it is being fed into the system (such as in a real-time system) then, obviously, we cannot wait to find out the length of the signal in question, the incoming signal must be split up into slices (which is what happens in a computer, anyway). Furthermore, once the signal has been split up, this simple frequency domain convolution will not work correctly, that is, you cannot just multiply a slice by the frequency domain impulse and inverse F.F.T. it again, as the slice has increased in size. Therefore, some form of overlap-add scheme must be used (Paterson- - 211 - Chapter 6 Stephens & Bateman, 2001). A block diagram showing this process is shown in Figure 6.5. h Conv Slice c0 Slice c1 Slice c0 0-Pad h Slice c2 Slice c3 Mult 0-Pad Slice c1 h 0-Pad Mult 0-Pad Slice c2 0-Pad h 0-Pad Slice c3 h Summation Overlap Mult 0-Pad Mult 0-Pad IFFTed Result IFFTed Result IFFTed Result Sum IFFTed Result Sum Sum Final Convolved Signal Figure 6.5 Fast convolution algorithm. The example shown in Figure 6.5 uses a slice length of 128 samples, an impulse length of 100 samples, and a zero-padded F.F.T. length of 256 samples (as 128+100-1 = 227 samples, and 256 is the next power of 2 higher than this). This system means that the minimum latency achievable by this method is measured by the slice size. This example is a specific example of the overlap add system, but shows perhaps the simplest overlap relationship between the multiplied segments. A more general relationship between the length of the slice, and the overlap for summation is given in Equation (6.2). - 212 - Chapter 6 Summation Overlap = (FFT Length) – (Length of Slice). where: (Length of Slice) + (Length of Impulse) – 1 <= FFT Length. (6.2) So, for this example, if the slice length is equal to 225 and the impulse length is 32, then the F.F.T. size could still be 256 (225+32-1=256), and the summation overlap would be 31 (256-225=31). This is a useful parameter to know so the length of the input slice can be maximised when compared to the F.F.T. size to increase the efficiency of the program (make more multiplies count, so to speak). For example, if an F.F.T. size of 256 samples was to be used and the impulse had a length of 32 samples, then a slice size of 225 should be used so as to minimise the summation overlap, and minimise the number of slices that the sound should be divided into (and, hence, the number of times the algorithm must be carried out). Due to the number of specific function calls and number types that are needed for this algorithm, it will be described in C later, when disscussing the more platform specific parts of the application. However, as an example, the Matlab code for such an algorithm is given in Table 6.1. slicesize=225; impsize=32; fftsize=256; if slicesize+impsize-1>fftsize error('FFT size must be GREATER or EQUAL to slicesize+impsize-1') end %Load signal and impulse ht=wavread('h0e045a.wav'); ct=wavread('Test.wav'); %Convert Stereo files to a mono array c=ct(:,2)'; h=ht(1:impsize,2)'; %create frequency domain impulse fh=fft(h,fftsize); %clear temp storage for summation block told=zeros(1,fftsize); %zero pad signal, if not an exact multiple of the %slice size if length(c)/slicesize~=ceil(length(c)/slicesize) c(length(c)+1:slicesize*ceil(length(c)/slicesize))=0; end for i=1:slicesize:length(c) - 213 - Chapter 6 %create frequency domain slice fc=fft(c(i:i+slicesize-1),fftsize); %multiply with impulse fr=fh.*fc; %IFFT result r = real(ifft(fr,fftsize)); %Summation of result (res) with portion of last result (told) res(i:i+slicesize-1) = r(1:slicesize) + told(1:slicesize); %update using last result ready for summation next time. told=zeros(1,fftsize); told(1:fftsize-slicesize) = r(slicesize+1:fftsize); end Table 6.1 Matlab code used for the fast convolution of two wave files. 6.2.3 Decoding Algorithms The crux of the algorithmic work carried out during this research is concerned with the decoding of the B-format (1st or 2nd order) signal, and it is these algorithms that will be discussed here. As all of the decoders (apart from the simplest multi-speaker decoders) rely on filtering techniques, they will be utilising the frequency domain filtering techniques discussed in section 6.2.2. The first step in all of the decoding schemes is to decode the Ambisonics audio to multiple speakers, as it was originally intended. As discussed in Chapter 5, for the most psychoacoustically correct decoding methods, crossover filtering must be used. So far, it has been established that the samples will arrive for processing, and be passed back into a 2-dimensional array, as this is the most flexible system of holding multi-channel audio data in memory. These Ambisonic audio streams will normally consist of 3, 5, 4 or 9 channels of audio data (1st order horizontal only, 2nd order horizontal only, full 1st order, or full 2nd order, respectively). The actual derivation of the coefficients needed for this process was covered in Chapter 5 and so will not be repeated here. All of the speaker feeds in an Ambisonic system are derived using combinations of the various channels available. To this end, it can be useful to specify an Ambisonic structure specifically so as to simplify writing audio applications later on. The structure used to represent an Ambisonic (1st or 2nd order) carrier will comprise: • • • Nine pointers to floats. An integer length parameter. A Boolean flag indicating a 1st or 2nd order stream. - 214 - Chapter 6 The decision as whether to make the Ambi variable a structure or a class was taken early on in this research, where a structure was decided upon. This was mainly because any functions using this Ambi variable would have to be made global functions, and so not associated with any Ambi structure in particular, and this was thought to be a less confusing system when dealing with more than one Ambisonic stream. However, in hindsight, it would have made little difference either way. The code for an Ambi structure is given in Table 6.2. #define FIRSTORDER 0 #define SECONDORDER 1 struct Ambi { float *W,*X,*Y,*Z,*R,*S,*T,*U,*V; int Length; bool Order; }; void AllocateAmbi(Ambi *aSig, const int iLen, bool bAllocChannels, bool bOrder) { aSig->Length = iLen; aSig->Order = bOrder; if(bAllocChannels) { aSig->W = new float[iLen]; aSig->X = new float[iLen]; aSig->Y = new float[iLen]; aSig->Z = new float[iLen]; if(bOrder==SECONDORDER) { aSig->R = new float[iLen]; aSig->S = new float[iLen]; aSig->T = new float[iLen]; aSig->U = new float[iLen]; aSig->V = new float[iLen]; } } } Table 6.2 Ambi Structure. Included in Table 6.2 is a function for allocating memory dynamically and setting the other flags for the Ambi structure. A choice of whether to allocate memory is necessary as two situations are possible: • The sources are entering the system as mono signals that are to be panned. The extra channels needed for an Ambisonic signal must be allocated. - 215 - Chapter 6 • A B-format signal (1st or 2nd order) is entering the system. These channels can be used directly by assigning pointers directly to these channels. As described in Chapter 5, there are two methods of decoding to an Ambisonic array. There is decoding to a regular array, and decoding to an irregular array. Of course, the decoding for a regular array is really just a special case of the irregular decoding (all of the speakers have the virtual response pointing in the same directions, with just the polar pattern altering for different frequency bands), and it has also been observed that for particularly large arrays, even simpler decoding should be used (Malham, 1998), limiting the amount of out of phase signal emanating from the speakers opposite the desired virtual source position. Let us first take the regular array case, as this is the simplest. A simple block diagram of this system is shown in Figure 6.9. Speaker Position Angles Convert to Cartesian Co-ordinates Low Frequency Decode with Pattern Select LF Polar Pattern HF Polar Pattern B-Format Signal Figure 6.6 Low Pass Filter High Frequency Decode with Pattern Select + + Multispeaker output High Pass Filter The regular array decoding problem. Figure 6.6 shows that several parameters and settings are needed for the decoder to act upon: • • Angular position of the speakers, converted to Cartesian co-ordinates using the Ambisonic decoding equations given in equation 3.4. Both a low frequency and a high frequency directivity factor, as shown in Equation (3.4). It is these two parameters that set the frequency dependent decoding. For frequency independent decoding, set both parameters to the same setting (0 – 2 = omni – figure of eight). - 216 - Chapter 6 Several functions are needed to fulfil decoding in order to minimise processing at run time. Mainly, this is carried out by the speaker position function. As the speakers are unlikely to move during system usage the Cartesian coordinates of the polar patterns routed to the speakers can be fixed. This means that all of the sine and cosine function calls can be made before the real-time part of the application is to be run (sine and cosine functions are very computationally expensive). A function used to calculate these decoding coefficients is shown in Table 6.. float ** DecoderCalc(float *fAzim, float *fElev, const int NoOfSpeakers, bool Order) { float **Result; //If 2nd Order decoder needed, 9 Rows if(Order) Result = 2DAlloc(9,NoOfSpeakers); //if 1st Order decoder needed, 4 Rows else Result = 2DAlloc(4,NoOfSpeakers); for(int i=0;i<NoOfSpeakers) { Result[0][i] = sqrt(2); //take off W offset of 0.707 Result[1][i] = cos(fAzim[i])*cos(fElev[i]);//X Result[2][i] = sin(fAzim[i])*cos(fElev[i]);//Y Result[3][i] = sin(fElev[i]);//Z if(Order) { Result[4][i] = 1.5f*sin(fElev[i])*sin(fElev[i]);//R Result[5][i] = cos(fAzim[i])*sin(2*fElev[i]);//S Result[6][i] = sin(fAzim[i])*sin(2*fElev[i]);//T Result[7][i] = cos(2*fAzim[i])*cos(fElev[i]) *cos(fElev[i]);//U Result[8][i] = sin(2*fAzim[i])*cos(fElev[i]) *cos(fElev[i]);//V } } //Return pointer to a two-dimensional array return (Result); } Table 6.3 Function used to calculate a speaker's Cartesian co-ordinates which are used in the Ambisonic decoding equations. If the coefficients calculated in Table 6.3 are used directly then each speaker will have a cardioid response, meaning that no out-of-phase material is produced from any of the speakers (assuming a perfect, non-reverberant, Bformat input captured from a perfect point source). However, it has been shown (see Chapter 5) that it can be beneficial to alter this polar response in order to make the decoder more psychoacoustically correct at different frequencies. For this, the equation shown in Equation (6.3), and discussed in Chapters 3 & 5 can be used for the final decoding. - 217 - Chapter 6 [ ] S = 0.5 × (2 − d )g wW + d (g x X + g y Y + g z Z ) (6.3) where: gx, gy, gz & gw are the speaker coefficients calculated using Table 6.3. d is the pattern selector coefficient (from 0 – 2, omni – figure of eight). As can be seen from Equation (6.3), it is a simple matter to include this equation in the final decoding function as it only involves a few extra multiplies per speaker, and does not use any computationally expensive sine or cosine functions. However, the decoding function is complicated slightly as a crossover needs to be implemented using the fast convolution function given in section 6.2.2 (although, strictly speaking only phase aligned ‘shelving’ filters are actually needed, the cross-over technique using FIR filters can be used for both regular and irregular decoders, whereas the shelving filters can only be used for regular decoders). A function for carrying out an Ambisonic crossover is shown in Table 6.4. #define BLen 2049 float WOldLP[BLen],XOldLP[BLen],YOldLP[BLen];//etc. float WOldHP[BLen],XOldHP[BLen],YOldHP[BLen];//etc. void AmbiXOver(Ambi *Source, Ambi *Dest, SCplx *LP, SCplx *HP, const int order) { //This exmample takes Source as the source, stores the LP //signal in Source, the HP signal in Dest, and takes LP and HP //as the frequency domain filter coefficients. //These original filters must be one sample less in length than //the buffer size const int Len = Source->Length; //copy samples memcopy(Source->W,Dest->W,Source->Length*4); memcopy(Source->X,Dest->X,Source->Length*4); memcopy(Source->Y,Dest->Y,Source->Length*4); memcopy(Source->Z,Dest->Z,Source->Length*4); if(Source->Order) { memcopy(Source->R,Dest->R,Source->Length*4); memcopy(Source->S,Dest->S,Source->Length*4); memcopy(Source->T,Dest->T,Source->Length*4); memcopy(Source->U,Dest->U,Source->Length*4); memcopy(Source->V,Dest->V,Source->Length*4); //Do second order Low pass OverAddFir(Source->R,LP,Len,Len-1,order,ROldLP); - 218 - Chapter 6 OverAddFir(Source->S,LP,Len,Len-1,order,SOldLP); OverAddFir(Source->T,LP,Len,Len-1,order,TOldLP); OverAddFir(Source->U,LP,Len,Len-1,order,UOldLP); OverAddFir(Source->V,LP,Len,Len-1,order,VOldLP); //Do second order High pass OverAddFir(Dest->R,HP,Len,Len-1,order,ROldHP); OverAddFir(Dest->S,HP,Len,Len-1,order,SOldHP); OverAddFir(Dest->T,HP,Len,Len-1,order,TOldHP); OverAddFir(Dest->U,HP,Len,Len-1,order,UOldHP); OverAddFir(Dest->V,HP,Len,Len-1,order,VOldHP); } //Do First order Low pass OverAddFir(Source->W,LP,Len,Len-1,order,WOldLP); OverAddFir(Source->X,LP,Len,Len-1,order,XOldLP); OverAddFir(Source->Y,LP,Len,Len-1,order,YOldLP); OverAddFir(Source->Z,LP,Len,Len-1,order,ZOldLP); //Do First order High pass OverAddFir(Dest->W,HP,Len,Len-1,order,WOldHP); OverAddFir(Dest->X,HP,Len,Len-1,order,XOldHP); OverAddFir(Dest->Y,HP,Len,Len-1,order,YOldHP); OverAddFir(Dest->Z,HP,Len,Len-1,order,ZOldHP); } Table 6.4 Ambisonic cross-over function This is the comprehensive version of this function, but it can be changed depending on the application. For example, the 2nd order checking and Z signal functions can be removed for a 1st order, horizontal only, application as this will save some processing time. Now that the crossover function has been given, a regular decoding function can be developed, and is shown in Table 6.5. void B2SpeakersReg(Ambi *Signal, float **Samples, float **Sp ,int NoOfSpeakers ,int NoOfChannels,float LPPattern ,float HPPattern) { static float WGainLP,XGainLP,YGainLP,ZGainLP; static float WGainHP,XGainHP,YGainHP,ZGainHP; //Do XOver using global Ambi variable Signal2 AmbiXOver(Signal, Signal2, LPCoefs, HPCoefs,Signal->Order); //Do loop check for both number of speakers, and number of //channels available on system, for testing on systems with //only a stereo sound card available for(int j=0;j<NoOfSpeakers && j<NoOfChannels;i++) { //Take pattern calculations out of loop //Calculate only once for each speaker //per buffer. WGainLP = 0.5f * (2-LPPattern) * Sp[0][j]; WGainHP = 0.5f * (2-HPPattern) * Sp[0][j]; XGainLP = 0.5f * LPPattern * Sp[1][j]; XGainHP = 0.5f * HPPattern * Sp[1][j]; YGainLP = 0.5f * LPPattern * Sp[2][j]; YGainHP = 0.5f * HPPattern * Sp[2][j]; ZGainLP = 0.5f * LPPattern * Sp[3][j]; - 219 - Chapter 6 ZGainHP = 0.5f * HPPattern * Sp[3][j]; for(int i=0;i<Signal->Length;i++) { //Do Low frequency pattern adjustment and decode Samples[j][i] = WGainLP * Signal->W[i] + XGainLP * Signal->X[i] + YGainLP * Signal->Y[i] + ZGainLP * Signal->Z[i]; //Do High frequency pattern adjustment and decode Samples[j][i] = WGainHP * Signal2->W[i] + XGainHP * Signal2->X[i] + YGainHP * Signal2->Y[i] + ZGainHP * Signal2->Z[i]; } } } Table 6.5 Function used to decode an Ambisonic signal to a regular array. For simplicity, Table 6.5 shows only a first order example, but this function could easily be extended to include second order functionality. The twodimensional ‘Samples’ array is now ready to be de-interlaced and passed back to the sound card for output. When it comes to the decoding of an irregular array two approaches can be taken: • • Let each speaker (or speaker pair) have a user-definable pattern, decoding angle and level. Have each speaker use decoding coefficients directly. That is, they are supplied after the pattern, decoding angle and level have been taken into account. Both of these methods are acceptable, with the first being most suited to optimising a decoder by ear and the second being most suited to using coefficients calculated using the heuristic HRTF decoding program described in Chapter 5. The latter will be slightly more efficient (although the program used to pre-calculate the coefficients could be changed to output the pattern, angle and level instead of the decoding coefficients directly). As all of the coefficients used for decoding to irregular arrays were calculated off-line in this project (using the Tabu search algorithm described in Chapter - 220 - Chapter 6 5), the second approach was used. The code used for this irregular decoder function is shown in Table 6.6. void B2SpeakerIrreg(Ambi *Signal, float **Samples, float **SpL, float **SpH, int NoOfSpeakers, int NoOfChannels) { static float WGainLP,XGainLP,YGainLP,ZGainLP; static float WGainHP,XGainHP,YGainHP,ZGainHP; //Do XOver using global Ambi variable Signal2 AmbiXOver(Signal, Signal2, LPCoefs, HPCoefs, Signal->Order); for (int j=0;j<NoOfSpeakers && j<NoOfChannels;j++ ) { //Use SpL & SpH decoding coefficients directly WGainLP = SpL[0][j]; WGainHP = SpH[0][j]; XGainLP = SpL[1][j]; XGainHP = SpH[1][j]; YGainLP = SpL[2][j]; YGainHP = SpH[2][j]; ZGainLP = SpL[3][j]; ZGainHP = SpH[3][j]; for (int i=0;i<Signal->Length;i++) { //Do Low frequency pattern adjustment and decode Samples[j][i] = WGainLP * Signal->W[i] + XGainLP * Signal->X[i] + YGainLP * Signal->Y[i] + ZGainLP * Signal->Z[i]; //Do High frequency pattern adjustment and decode Samples[j][i] = WGainHP * Signal2->W[i] + XGainHP * Signal2->X[i] + YGainHP * Signal2->Y[i] + ZGainHP * Signal2->Z[i]; } } } Table 6.6 Function used to decode an Ambisonic signal to an irregular array. This function is very similar to the one shown in Table 6.5, except that two separate sets of speaker coefficients must be provided since they are potentially very different (not just different in polar pattern, as in a regular speaker array). The multi-speaker array given above is possibly the most complex form of decoding as the other types (transaural multi-speaker and headphone) are based upon binaural technology and, to this end, will only need to be set up once for optimal reproduction. - 221 - Chapter 6 As discussed in Chapter 4, in order to reproduce an Ambisonic system binaurally the separate speaker coefficients can be easily represented as a set of HRTFs with one HRTF for each of the Ambisonic signals (that is, W, X, Y etc.), or two if the rig-room-head combination are not taken to be left/right symmetrical. So, for example, a second order, horizontal only decode would be replayed binaurally using the equation shown in Equation (6.4). Left = (W ⊗ Whrtf ) + (X ⊗ X hrtf ) + (Y ⊗ Yhrtf ) + (U ⊗ U hrtf ) + (V ⊗ Vhrtf ) Right = (W ⊗ Whrtf ) + (X ⊗ X hrtf ) − (Y ⊗ Yhrtf ) + (U ⊗ U hrtf ) − (V ⊗ Vhrtf ) (6.4) where: W, X, Y, U & V are the Ambisonic signals. hrtf denotes a HRTF filter response for a particular channel. ⊗ denotes convolution. What is possibly not apparent on first inspection is that, when compared to an optimised speaker decode, a binaural simulation of an Ambisonic decoder actually requires less convolutions if left/right symmetry is assumed (half as many, in fact) and the same amount of convolutions if left/right symmetry is not assumed. This is due to the fact that both the crossovers and differing levels/polar patterns can be taken into account at the design time of the Ambisonic signal filters. A function used to decode a horizontal 1st order Ambisonic signal is shown in Table 6.7. #define BLen 2049 #define Order 12 //FFT Length 2^12=4096 float WOld[BLen],XOld[BLen],YOld[BLen]; //Function assumes impulse length is 1 sample less than //buffer length (i.e. 2048) void B2Headphones(Ambi *Signal, float **Samples, SCplx *WFilt, SCplx *XFilt, SCplx *Yfilt, int NoOfChannels) { const int Len = Signal->Length; OverAddFir(Signal->W,WFilt,Len,Len-1,Order,WOld); OverAddFir(Signal->X,XFilt,Len,Len-1,Order,XOld); OverAddFir(Signal->Y,YFilt,Len,Len-1,Order,YOld); for(int i=0;i<Len;i++) { //Left Signal Samples[0][i]=Signal->W[i] + Signal->X[i] + Signal->Y[i]; //Right Signal Samples[1][i]=Signal->W[i] + Signal->X[i] - Signal->Y[i]; - 222 - Chapter 6 } //If more than two channels were inputted and are to be //outputted (i.e. took B-format signal in from live //input) then other channels must be cleared. for(int i=2;i<NoOfChannels;i++) { for(int j=0;j<Len;j++) Samples[i][j] = 0; } } Table 6.7 Function used to decode a horizontal only, 1st order, Ambisonic signal to headphones. From the B2Headphones function given above, it is easy to see how this function can be extended to a two-speaker transaural representation. The block diagram for a two-speaker transaural reproduction is given in Figure 6.7. H1 Filter Left Ear Signal H2 Filter Right Ear Signal H2 Filter + + + + Left Speaker Right Speaker H1 Filter Figure 6.7 A two-speaker transaural reproduction system. The method for calculating and optimising the filters needed for this arrangement were discussed in Chapter 5. For the four-speaker version of the crosstalk cancellation not only is the above algorithm (shown in Figure 6.7) needed to be run twice, but also four signals must be provided (front left and right, and rear left and right ear signals). These can be calculated using a system very similar to the one shown in Equation (6.4), except that the front left and right HRTF filters (for the conversion to binaural) will only be taken using the gains from the front speakers, and the rear left and right HRTFs will be calculated using the gains from the rear speakers. Example sets of HRTFs for this purpose are shown in Figure 6.8 (simple, cardioid decoding, with no cross-over filtering present). These graphs show that, although the decoder is not taken as a whole, as - 223 - Chapter 6 long as the front and rear portions of the speaker rig are left/right symmetric, the same binaural simplification can be used where only one HRTF is needed for each of the Ambisonic channels. A block diagram of this four-channel crosstalk cancellation system is shown in Figure 6.9. The coding for this section is an extension of the B2Headphones function given in Table 6.7, with an extra call to a transaural function, B2Trans, given in Table 6.7. Figure 6.8 Bank of HRTFs used for a four-channel binauralisation of an Ambisonic signal. W X Y Figure 6.9 HRTF Simulation (3 FIRs) Front Crosstalk Cancellation (4 FIRs) HRTF Simulation (3 FIRs) Rear Crosstalk Cancellation (4 FIRs) To Front Left Speaker To Front Right Speaker To Rear Left Speaker To Rear Right Speaker Block digram of a four-speaker crosstalk cancellation system. #define BLen 2049 //Flag that is set for 2 and 4 speakers //transarual reproduction. bool Trans4; float FLOld[BLen],FROld[BLen],FLCOld[BLen],FRCOld[BLen]; float RLOld[BLen],RROld[BLen],RLCOld[BLen],RRCOld[BLen]; - 224 - Chapter 6 void BToTrans(float **Samples,SCplx *h1, SCplx *h2, const int BufferLength, const int NoOfChannels) { //Samples should be housing up to four channels, //front left, front right //back left, and back right binaural signals. static float FLCopy[BLen]; static float FRCopy[BLen]; memcpy(FLCopy,Samples[0],BufferLength*4); memcpy(FRCopy,Samples[1],BufferLength*4); int ChUsed=2; //Do 2 Speaker Transaural OverAddFir(Samples[0],h1,Len,Len-1,Order,FLOld); OverAddFir(Samples[1],h1,Len,Len-1,Order,FROld); OverAddFir(FLCopy,h2,Len,Len-1,Order,FLCOld); OverAddFir(FRCopy,h2,Len,Len-1,Order,FRCOld); float FL,FR; for (int i=0;i<BufferLength;i++) { FL = Samples[0][i]; FR = Samples[1][i]; Samples[0][i] = FL + FRCopy[i]; Samples[1][i] = FR + FLCopy[i]; } //Do 4 speaker transaural if flag says true if(Trans4 && NoOfChannels>=4) { static float RLCopy[BLen]; static float RRCopy[BLen]; memcpy(RLCopy,Samples[2],BufferLength*4); memcpy(RRCopy,Samples[3],BufferLength*4); OverAddFir(Samples[2],h1,Len,Len-1,Order,RLOld); OverAddFir(Samples[3],h1,Len,Len-1,Order,RROld); OverAddFir(RLCopy,h2,Len,Len-1,Order,RLCOld); OverAddFir(RRCopy,h2,Len,Len-1,Order,RRCOld); float RL,RR; for (int i=0;i<BufferLength;i++) { RL = Samples[2][i]; RR = Samples[3][i]; Samples[2][i] = RL + RRCopy[i]; Samples[3][i] = RR + RLCopy[i]; } ChUsed=4; } //Clear other output channels, ready for outputting for(int i=ChUsed;i<NoOfChannels;i++) { for(int j=0;j<Len;j++) Samples[i][j] = 0; } } Table 6.8 Code used for 2 and 4 speaker transaural reproduction. - 225 - Chapter 6 6.3 Implementation - Platform Specifics All of the algorithmic work discussed so far in this project has been platform independent, that is, all of the functions could be implemented on any platform that supports floating point operations and standard C. However, there has to come a point where a specific platform must be chosen, and then more specialised functions are usually needed depending on the hardware/operating system used. In this project the Microsoft Windows™ operating system was used, which possesses a number of APIs for interfacing with the sound system via Windows: • • • Waveform Audio (windows multi-media system) Direct Sound (part of the Direct X API) ASIO (Steinberg’s sound API). The system used in this project was the standard waveform audio system. There were a number of reasons for this: • • Waveform audio had easy support for multi-channel sound. All windows compatible sound cards had good support for this API. Although information about the Waveform Audio API is reasonably widespread (for example, see Kientzle (1997) and Petzold (1998) Chapter 22) none give a comprehensive guide to setting up a software engine for signal processing (that is, capturing some audio live or from wave files, processing it, and outputting the processed audio). For this reason, This section of the report will give an in depth summary of how the software used in this project was structured and implemented so it can be used as a starting reference for further research to be carried out. So, what is the Waveform Audio API? The Waveform Audio API is a layer of functions that sits between the programmer and the sound card. This means that the function calls necessary to set up and successfully run an audio application will be the same no matter what make or model of sound card the computer possesses. In this system the input and the output ports of the soundcard work seemingly independently, and so each must be taken as a separate entity and programmed for accordingly. For example, just because - 226 - Chapter 6 the output device has been set up as a 44.1 kHz, 16-bit sample stream, this does not mean that the input device will automatically take these settings when it is started. Any device activated (be it input or output) using the waveform audio API must have a number of parameters set and structures available for use. Firstly, let us examine the parameters that must be set before an output device can be started: • • • • • • • Data type (for example, fixed or floating point). Number of Channels (for example, 1 – mono, 2 – stereo, 4, 8). Sample rate in Hz. (for example, 44100 or 48000). Bits per sample (for example, 8, 16). Block align – the alignment of the samples in memory (i.e. the size of the data for one sample for all of the channels, in bytes). Average bytes per second. Buffer size in bytes. Using all of the above data, the Waveform audio API is almost ready to set up the input/output devices, however, let us first look at the block diagram of the waveform audio system as shown in Figure 6.10. Message: Ready for Samples WaveHDR Send to Soundcard WaveHDR WaveHDR Processed Samples Figure 6.10 WaveHDR Waveform audio block diagram – Wave out. As can be seen from this diagram, the soundcard actually informs the program when it has finished with the last buffer and is ready for the next one. This is because Windows is a message based operating system. That is, the application either passes messages, or waits to receive messages from the Windows operating system. These work in much the same way as software interrupts on a D.S.P. device, and mean that the application does not have to run in a loop, but process and send the appropriate messages in order to - 227 - Chapter 6 keep the program running. A WaveHDR is a structure that represents a buffer of audio samples, along with a few other parameters. A WaveHDR is arranged as shown in Table 6.9. /* wave data block header */ typedef struct wavehdr_tag { LPSTR lpData; /* pointer to locked data buffer */ DWORD dwBufferLength; /* length of data buffer */ DWORD dwBytesRecorded; /* used for input only */ DWORD dwUser; /* for client's use */ DWORD dwFlags; /* assorted flags (see defines) */ DWORD dwLoops; /* loop control counter */ struct wavehdr_tag FAR *lpNext; /* reserved for driver */ DWORD reserved; /* reserved for driver */ } WAVEHDR, *PWAVEHDR, NEAR *NPWAVEHDR, FAR *LPWAVEHDR; WaveHDR structure. Table 6.9 Of all of the various parameters available from a WaveHDR structure, only a few of them are of importance for this application. These are: • • • lpData – Pointer to an array of bytes used for the storage of samples. dwBufferLength – Holds the length of the buffer (in bytes). dwFlags – Holds flags signifying that the buffer is finished with, prepared etc.. At least two of these wave headers need to be sent to either the input or output device in order for seamless audio to be heard or captured. If only one is used then an audible gap will be heard as the buffer is refilled and sent back to the device (in the case of an output device). However, as many buffers as is desired can be sent to the device, which windows will automatically store in a queue. The other major structure that is used by the waveform audio API is WaveformatEX. This structure is used to hold nearly all of the data that must be presented to Windows in order to successfully open a device. The format of the WaveformatEX structure is given in Table 6.10. /* * extended waveform format structure used for all non-PCM formats. * this structure is common to all non-PCM formats. */ typedef struct tWAVEFORMATEX { /* format type */ WORD wFormatTag; WORD nChannels; /* number of channels (i.e. mono, stereo...) */ DWORD nSamplesPerSec; /* sample rate */ - 228 - Chapter 6 DWORD WORD WORD nAvgBytesPerSec; nBlockAlign; wBitsPerSample; /* for buffer estimation */ /* block size of data */ /* number of bits per sample of mono data */ /* the count in bytes of the size WORD cbSize; of extra information (after cbSize) */ } WAVEFORMATEX, *PWAVEFORMATEX, NEAR *NPWAVEFORMATEX, FAR *LPWAVEFORMATEX; Table 6.10 WaveformatEX structure. As can be seen by the comments in Table 6.9 and Table 6.10, all of the necessary information is now potentially available for any device that is to be opened, be it an input, or an output device. Various functions are used in the initialisation and running of a Wave device and the structures given in Table 6.9 and Table 6.10 are relied upon to provide the necessary information and memory allocation needed. Example code used to initialise a wave out device is shown in Table 6.11. WAVEHDR WOutHdr[2]; WAVEFORMATEX wf; HWAVEOUT hWaveOut; void InitialiseWaveOut( unsigned int Device, unsigned short usNoOfChannels, unsigned short usSRate, unsigned short usBLength) { //Pass WAVEFORMATEX structure necessary data wf.wFormatTag = WAVE_FORMAT_PCM; wf.nChannels = usNoOfChannels; wf.nSamplesPerSec = usSRate; wf.wBitsPerSample = 16; wf.nBlockAlign = wf.nChannels * wf.wBitsPerSample / 8; wf.nAvgBytesPerSec= wf.nSamplesPerSec * wf.nBlockAlign; wf.cbSize = 0; if(Device==0) //let windows choose device Device=WAVE_MAPPER; else //else, use specified device Device--; //Open wave device, specifying callback function //used to catch windows messages from device waveOutOpen(&hWaveOut,Device,&wf,(DWORD)WOCallback, (DWORD)this,CALLBACK_FUNCTION); waveOutPause(hWaveOut); //Allocate memory for 2 buffers, and pass them to wave device for(int i=0;i<2;i++) { WOutHdr[i].dwBufferLength = usBLength * wf.wBitsPerSample * wf.nChannels/8; WOutHdr[i].lpData = new char[WOutHdr[i].dwBufferLength]; - 229 - Chapter 6 WOutHdr[i].dwFlags = 0; WOutHdr[i].dwLoops = 0; waveOutPrepareHeader(hWaveOut,&WOutHdr[i],sizeof(WOutHdr[i])); waveOutWrite(hWaveOut,&WOutHdr[i],sizeof(WOutHdr[i])); } //Start wave out device waveOutRestart(hWaveOut); } //------------------------------------------------------------------void CALLBACK WaveOutCallback(HWAVEOUT hwo, UINT uMsg, DWORD dwInstance, DWORD dwParam1, DWORD dwParam2) { switch(uMsg) { case WOM_DONE: { //If WOM_DONE, call function used to fill buffer //WAVEHDR buffer passed in to callback function //as dwParam1 WaveOutFunc((WAVEHDR *)dwParam1); break; } default: break; } } Table 6.11 Initialisation code used to set up and start an output wave device. As shown in Table 6.11, a call-back function must be specified in order to process the Windows’ messages that are passed by the waveform audio system. For the output device the most important message is WOM_DONE. This message is passed to the call-back function every time the wave out device has finished with the WAVEHDR buffer, where a function can then be called that fills the buffer with processed samples using the processing techniques shown in Chapter 6.2 (in this case, the WaveOutFunc function is called, passing with it a WaveHdr structure). The Wave In device is configured in much the same way by the Windows operating system, although it is interesting to note that the input and output devices are both taken to be two separate devices. To this end, no automatic connection between the two devices exists and it is the programmer that must store the input samples and then pass them to the output device (this is, of course, assuming that both input and output devices have been initialised at the same frequency, bit rate and channel numbers). - 230 - Chapter 6 In Windows, many audio devices can be opened simultaneously, which is necessary as most multi-channel sound cards default to being configured as a number of stereo devices. However, for true multi-channel sound reproduction it is necessary to have a card that can be configured as one multi-channel device. This is due to the fact that Windows cannot open and start multiple devices at exactly the same time and, although some sound card manufacturers quote that the drivers will synchronise multiple devices, this has not been found to be the case when using their standard wave drivers. This can potentially cause problems when using such a card to feed an array of speakers used for multi-channel surround sound, as the time alignment of the output channels is assumed to be perfect. Although this artefact is not readily noticeable, it is obviously more desirable to start with a system that is as theoretically perfect as possible and so a single multichannel device should be used, if possible. Having one multi-channel device also simplifies the processing as multiple call-back functions are not used. This effect was discovered using the Matlab add-on, Simulink. The block arrangement used to document this feature is shown in Figure 6.11. Figure 6.11 Simulink model used to measure inter-device delays This system was used to test the latency of various devices a number of times and not only was the inter-device latency apparent, but it also changed between test runs. An example plot is shown in Figure 6.12, showing just four - 231 - Chapter 6 devices, to make the graph more readable. This variable device latency means that it is almost impossible to correct, and so a single device should be used. D elay betw een opening of devices 0 .2 devi ce devi ce devi ce devi ce 0.1 5 1 2 3 4 0 .1 magnitude 0.0 5 0 -0.0 5 -0 .1 -0.1 5 -0 .2 -0.2 5 -0 .3 Figure 6.12 0 200 0 4 00 0 6 000 8 000 10 000 ti me i n samples at 4 4.1 K Hz sa mp li ng frequency 120 00 Graphical plot of the output from 4 audio devices using the Waveform audio API. In order to successfully close an audio device, a number of API calls must be made. This is shown (for the output device) in Table 6.12. void CloseDevice(UINT Device) { //Reset Wave Device waveOutReset(hWaveOut); //Unlock and delete dynamic memory allocated for WAVEHDRs for(UINT i=0;i<NoOfBuffers;i++) { waveOutUnprepareHeader(hWaveOut,&WaveHeadersOut[i], sizeof(WaveHeadersOut[i])); if(WaveHeadersOut[i].lpData) delete [] WaveHeadersOut[i].lpData; } //Close Wave Device waveOutClose(hWaveOut); } Table 6.12 Closing a Wave Device Both the opening and closing of an input wave device is identical to that of an output wave device, with the only difference being the message passed to the call-back function. As all of this coding is Windows dependent (that is, it will never be needed for any other system), the wave device functions were encapsulated within a - 232 - Chapter 6 class. This meant that a basic ‘pass-through’ application could be coded, that did no processing. A new class could then be created, inheriting from this first class, but with the processing functions being redeclared so that minimal extra coding is needed for every new sound processing application that is to be written. In order for this first class to be as flexible as possible, a signal processing function for both incoming and outgoing samples has been written. This means that the signal can be monitored (or processed) just after the input and just before the output of the audio to the soundcard. A block diagram of the structure of this class is shown in Figure 6.13. Calls made from Application Invoked by Windows Messages WIM_DATA Message Initialise Init & Alloc memory for WaveHDRs Create Sample Queue Open Device Open In/Out Devices Prepare & Send Buffers to devices Start Devices Figure 6.13 WaveInFunc AddToQueue ProcessIn Call AddToQueue Add Used Buffer to device Add new samples to audio queue Function to be OverRidden WOM_DONE Message WaveOutFunc ProcessOut Call Process Out Add Used Buffer to device Function to be OverRidden Block Diagram of Generic ‘pass-through’ Audio Template Class It can be seen from Figure 6.13 that, apart from the initialisation and opening of the audio devices, the whole of the audio subsystem is driven by messages. The WIM_DATA message signalling that an audio buffer is ready for use (i.e. full) causes the WaveInFunc to call a function that adds this audio data to a data queue. Then, when the WOM_DONE message has been received signalling that an output buffer is ready to be filled again, the ProcessOut function is called, which is where the audio processing will be carried out on the data at the end of the audio queue, and then passed to the empty output device. An example of the overridden ProcessOut function is - 233 - Chapter 6 shown in Table 6.13. Example code for the whole of this base class can be found in the Appendix. void ProcessAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { //Output Callback //Grab pointers to in and out buffers short *inPtr = (short *)ReadBuffer->lpData; short *outPtr = (short *)pWaveHeader->lpData; float yn; for( unsigned int i=0;i<usBufferLengthPerChannel*usNoOfChannels; i+=usNoOfChannels) { //Left Channel yn = (float)inPtr[i]; //Processing Here outPtr[i] = (short)yn; //Right Channel yn = (float)inPtr[i+1]; //Processing Here outPtr[i+1] = (short)yn; } } Table 6.13 Example implementation of the ProcessAudio function for a Stereo Application. 6.4 Example Application Using the signal processing and wave API code given above, it is now a relatively simple task to build an example signal processing application. In this research project the programming environment of Borland C++ Builder was used (Borland Software Corporation, 2003). This environment has the advantage of drag and drop development of graphical user interfaces using standard Windows components, Borland’s own components or custom components based on one of Borland’s component templates. This greatly simplifies the GUI creation process meaning that working, flexible applications can be coded quickly, which then makes the use of a powerful, high level language, such as C++, a valuable signal processing prototyping tool. As stated above, applications written for the Windows operating system can be programmed using the C++ programming language. The object oriented approach lends itself well to audio programming, particularly when filtering is involved (which it generally is). This is because for each signal that needs to - 234 - Chapter 6 be filtered, separate memory locations are needed for that particular signal’s feedback, feedforward, or delay line features. When coding filters in C it is the developer that must group all of this memory together, which can be cumbersome at times with different types of filters needing different memory requirements. For example, the fast convolution algorithm described in section 6.2.2 needs an additional amount of memory for each channel filtered. The size of this memory must be the same size as the FFT window size, that is, it must be larger than the size of the incoming signal. Once other types of filter are also introduced the subsequent memory requirement would soon become complicated and difficult to follow. This, on its own, is not a large problem, but means that all the memory requirements for a filter function must be clearly documented using comments, and strictly adhered to by the developer. However, in C++ a filter ‘object’ can be created. Inside this object, all the extra memory requirements can be hidden from the programmer with as many filter objects created as needed. This means that each filter object can be imagined as one filter device in a studio, operating on one audio stream. Initially, all the same memory requirements must be taken care of, but once implemented inside a C++ class this can then be used as a template where the developer only has access to, perhaps, a ‘processaudio’ function. A simple template for such a class is shown in Table 6.14. class AllPass { private: float fs,fc,alpha,*Buffer; float ff,fb,in,out; const int BufLen; void DoAllPass(float *signal, int iLen, float aval); public: AllPass(int iLen); ~AllPass(); void SetCutOff(float fcut, float fsam); void ProcessAudio(float *signal, float dBLP, float dBHP, bool dBdummy); void ProcessAudio(float *signal, float LinLP, float LinHP); }; Table 6.14 C++ Class definition file for an allpass based shelving equalisation unit. An object of type AllPass can now be initialised in the normal way in the application. However, due to the fact that the private variable BufLen, representing the length of an audio buffer, has been declared constant, this object must be initialised with an integer length (see the constructors and - 235 - Chapter 6 destructors, AllPass(int iLen) & ~AllPass()). This means that, unless the application has a fixed buffer length, the object must be declared dynamically at run time. Looking at this object definition file further it can be seen that the developer only has access to five functions; a constructor and a destructor that are called automatically when a new AllPass object is created or destroyed, a SetCutOff function, and two ProcessAudio functions. The latter have been created in order to give this class improved flexibility, with one function making use of linear gain values, and the other making use of dB gain values. As the same function names need to have some difference in their passed values, a dummy, unused variable has been included in one of the functions to indicate that dB gains are used. Also, it can be noted that all of the variables associated with this class are declared private, meaning that the calling object has no access to these variables, protecting them from potential wrong doing. All of these variables are updated, as needed, by the underlying code in the class, either at initialisation, or by a public member function. This ensures that the filter is secure and as intuitive to use as possible, with the developer only having access to the functions needed, and no more. This method was also used for the fast convolution filter, greatly simplifying the knowledge needed by the developer to use this function. The definition file is shown in Table 6.15. class FastFilter { private: int order,fftsize,siglen,implen; float *OldArray,*Signal,*tconv,*h; SCplx *fh,*fSig,*fconv; public: FastFilter(int FFTOrder,AnsiString *FName,int FLength); ~FastFilter(); void ReLoadFilter(AnsiString *FName,int FLength); void OverAddFir(float *signal); }; Table 6.15 C++ class definition file for the fast convolution algorithm Again, a system very similar to that shown in the AllPass filter class definition file can be seen. However, if the constructor of this class is shown, it can be - 236 - Chapter 6 seen how much work is taken away from the developer when using this class, as shown in Table 6.16. FastFilter::FastFilter(int FFTOrder,AnsiString *FName,int FLength) { order = FFTOrder; fftsize = pow(2,order); siglen = (fftsize/2) + 1; implen = fftsize/2; OldArray = new float[fftsize]; Signal = new float[fftsize]; tconv = new float[fftsize]; h = new float[fftsize]; fh = new SCplx[fftsize]; fSig = new SCplx[fftsize]; fconv = new SCplx[fftsize]; ReLoadFilter(FName,FLength); nspsRealFftNip(NULL,NULL,order,NSP_Init); nspsRealFftNip(h,fh,order,NSP_Forw); } Table 6.16 Constructor for the FastFilter class As is immediately evident, the memory requirements of this class are complicated, with a number of memory spaces of two variable types (representing data in both the time and frequency domain) needing to be dynamically created and destroyed when necessary. Also, the size of the coefficients used in FIR filters can be large, meaning that entering them into the code is unfeasible. So, this class actually takes in a filename that contains the list of numbers used in the filter, in single precision format. This means that the filters can be quickly designed and saved to a file format in Matlab, and then tested quickly using a C++ Windows application without the need for any changes in the code of the application, meaning that recompilation is not necessary. The Matlab code used to create these files and the C++ code used to read them are shown in Table 6.17 and Table 6.18 respectively. function count = savearray(array, fname); %save array to .dat file for reading in a c program % for example count = savearray(array,'c:\coefs.dat'); fid = fopen(fname, 'w'); count = fwrite(fid,array,'float'); fclose(fid); Table 6.17 Matlab function used to write FIR coefficients to a file. #include <fstream.h> void FastFilter::ReLoadFilter(AnsiString *FName,int FLength) { - 237 - Chapter 6 FILE *f; int c; memset(OldArray,0,sizeof(float)*fftsize); memset(Signal,0,sizeof(float)*fftsize); memset(tconv,0,sizeof(float)*fftsize); memset(h,0,sizeof(float)*fftsize); memset(fh,0,sizeof(SCplx)*fftsize); memset(fSig,0,sizeof(SCplx)*fftsize); memset(fconv,0,sizeof(SCplx)*fftsize); f = fopen(FName->c_str(),"rb"); if(f) { c = fread(h,sizeof(float),FLength,f); if(c!=FLength) MessageBox(NULL,"Filter Length Error", "Filter Length Error", NULL); fclose(f); } else MessageBox(NULL,"Cannot open file", "Cannot open file", NULL); } Table 6.18 C++ code used to read in the FIR coefficients from a file. Now the main signal processing classes have been constructed, the application can be designed. This example application was designed to test a number of the optimisation techniques discussed in Chapter 5. However, the irregular Ambisonic array testing was carried out in Simulink, and is not implemented in this application in order to keep things a little simpler. It will be capable of taking in a first order B-format signal (comprised of four wave files, as this is how most of our B-format material is archived), or one mono wave file for panning into a B-format signal. If a mono source is used, then this can be panned using a rotary dial, and if a B-format signal is used, then the sound field can be rotated using a rotary dial. The user is able to choose from four different decoding methods: • • • Optimised eight speaker regular Ambisonics (using the allpass filters described above). Ambisonics to binaural transform (based on an eight speaker array). Ambisonics to two speaker transaural with speaker placements at: o +/- 30 o +/- 50 o +/- 100 o +/- 200 - 238 - Chapter 6 • o +/- 300 Ambisonics to four speaker transaural with front speaker placements as above, and rear speaker placements at: o +/- 50 o +/- 100 o +/- 200 o +/- 300 o +/- 700 In addition to these modes of reproduction, a source from the line input can also be used so that the transaural filters (two speaker algorithm) can be tested with CD material (both binaural and normal stereo). In order to utilise all of the transforms discussed above, a total of fifty six filters must be made available to the application as there must be two versions of each filter. One sampled at 44.1 kHz and another sampled at 48 kHz. This is another reason why writing these to separate data files saves time and programming effort. To facilitate the above formats, a GUI was constructed as shown in Figure 6.14. All of the controls used are standard Windows controls, apart from the two rotary controls used for altering the mono source panning and b-format rotation. The code for the creation of the rotary controls will not be discussed here, however, but can be found in the Appendix. - 239 - Chapter 6 Figure 6.14 Screen shot of simple audio processing application GUI. In the audio subsystem class, there are two main tasks to be carried out: • • Initialisation/deinitialisaton of filter structures and graphical oscilloscope. Process audio function. In order for to avoid storing fifty six FIR filters in memory at once (and, for that matter, having to manage fifty six FIR filter structures in the program code), only the filters currently available for use will be stored in memory. These are: • • • • • 3 Allpass filters for the eight speaker Ambisonic decoder. 3 FIR filters for Ambi to two ear binaural processing 6 FIR filters for Ambi to four ear binaural processing 4 FIR filters for binaural to two speaker transaural processing 4 FIR filters for binaural to four speaker transaural processing (8 used in this algorithm in total). It is only the crosstalk cancellation filters that need to be updated in real time, and so, in order to facilitate this, the GUI sets a flag to true whenever a filter needs changing (that is, the transfilter and rear filter radio boxes are changed). The audio subsystem checks this flag at the start of every audio buffer and, if set, reloads the appropriate filter from disk. A block diagram of the audio function for this application is shown in Figure 6.15. - 240 - Chapter 6 Check for wave file skip flag true Move wave file pointer. false Check input type AmbiIn Copy four wave files data to AmbiBuffer MonoIn Copy one wave file data to buffer LiveIn Deinterlace incoming ReadBuffer to a 2D Samplebuffer Pan into AmbiBuffer Stereo-> Check Transx2 decode type Rotate Bformat signal Ambisonics Ambi-> Binaural Ambi-> Transx2 Ambi-> Transx4 3xFIR Bformat 3x2xFIR Bformat Allpass Bformat 3xFIR Bformat 4xFIR 2channel 4x2xFIR 4channel 4xFIR 2channel 8 Speaker Decode to 2D array 2 Speaker binaural Decode to 2D array 2 Speaker transaural Decode to 2D array 4 Speaker transaural Decode to 2D array 2 Speaker transaural Decode to 2D array Signifies potential filter update here Figure 6.15 Re-Interlace into WAVEHDR Block diagram of the applications audio processing function. The audio processing function is simplified because all of the various processing algorithms are carried out in separate objects/functions, making the coding a simpler task, as each function can be taken in isolation. So, for the final section of coding needed for this example application, the decoder type switch statement and code is shown in Table 6.19. switch(Window->m_effect) { case 0: //8 Speaker Ambisonics - 241 - Chapter 6 WAP->ProcessAudio(ABuf->W,1.33,1.15); XAP->ProcessAudio(ABuf->X,1.33,1.15); YAP->ProcessAudio(ABuf->Y,1.33,1.15); B2Speakers(Decode,ABuf,Samples,usNoOfChannels,8,0); break; case 1: //Ambisonics to Binaural B2Headphones(ABuf,Samples,usNoOfChannels); break; case 2: //Ambisonics to Binaural to Transaural x 2 if(UpdateFilter) { ChooseFilter(SampleRate); UpdateFilter = false; } B2Headphones(ABuf,Samples,usNoOfChannels); B2Trans(ABuf,Samples[0],Samples[1], usNoOfChannels,h1fl,h2fl,h1fr,h2fr); break; case 3: //Ambisonics to Binaural x 2 to Transaural x 4 if(UpdateFilter) { ChooseFilter(SampleRate); UpdateFilter = false; } if(UpdateRearFilter) { ChooseRearFilter(SampleRate); UpdateRearFilter = false; } B2Headphones4(ABuf,BBuf,Samples,usNoOfChannels); B2Trans(ABuf,Samples[0],Samples[1], usNoOfChannels,h1fl,h2fl,h1fr,h2fr); if(usNoOfChannels>=4) B2Trans(ABuf,Samples[2],Samples[3], usNoOfChannels,h1rl,h2rl,h1rr,h2rr); break; case 4: //Live input to Transaural x 2 if(UpdateFilter) { ChooseFilter(SampleRate); UpdateFilter = false; } B2Trans(ABuf,Samples[0],Samples[1], usNoOfChannels,h1fl,h2fl,h1fr,h2fr); break; default: //if none of the above B2Speakers(Decode,ABuf,Samples,usNoOfChannels,8,0); break; } Table 6.19 Decoding switch statement in the example application To look at the code in its entirety, this example application is given in the Appendix. 6.5 Conclusions Writing the application in this modular fashion makes the potentially complex audio processing function much easier to manage and change, if necessary, - 242 - Chapter 6 and has resulted in a large library of functions and classes that can be used to create a working multi-channel audio application very quickly. Due to the use of the fast convolution algorithm, and the utilisation of the Intel Signal Processing Library (although Intel have now discontinued this, and the Intel Integrated Performance Primitives must be used instead (Intel, 2003b)), the implemented surround sound system will run on Intel Pentium II processors and faster, even when decoding to eight or more speakers. Most of the Ambisonic algorithmic testing was carried out in Matlab and Simulink, but regarding sound quality, the software libraries described in this Chapter work well and without audio glitches. It must also be noted that using custom C software was the only way to test and evaluate the Transaural or Binaural decoders in real-time due to the lack of a real-time (that is, frame based) overlap add convolution function in Simulink, so this software was invaluable in the rapid evaluation and testing of the crosstalk cancellation filters described in Chapter 5. - 243 - Chapter 7 Chapter 7 - Conclusions 7.1 Introduction This thesis has identified the following problems with the current state of surround sound systems (as described in Section 3.4): 1. Although Gerzon and Barton (1992) suggested a number of optimisation equations for use with irregular speaker arrangements, the equations are difficult to solve, and so no further research seems to have been carried out in this area. 2. At least four speakers must be used to decode a horizontal 1st order signal, and six speakers must be used to decode a horizontal 2nd order system and although the conversion to binaural has been done by McKeag & McGrath (1996) initially, and later by Noisternig et al. (2003), none of this work takes into account the correct presentation of the lateralisation parameters which has been addressed in point 1, above. 3. Only a handful of software utilities for the encoding and decoding of Ambisonic material are available (McGriffy, 2002), and no psychoacoustically correct decoding software for irregular arrays exists. These problems have been addressed in this research as follows: 1. A method of solving the equations given by Gerzon and Barton (1992) has been demonstrated that simplifies the design of Ambisonic decoders for irregular speaker arrangements using the velocity and energy vector criterion as described by Gerzon & Barton (1992) which also corrects the problem of low and high frequency decoder discrepancies as shown in section 5.3. 2. Also, a new method of HRTF analysis has been developed in order to differentiate between decoders designed using the method described in point 1, above. This data has then been utilised directly in the design of multi-channel decoders. This form of decoder is not strictly Ambisonic, as it does not conform to the Ambisonic definition as described by Gerzon & Barton (1998) and described in section 3.3.1, but will allow for the further optimisation of the B-Format decoding - 244 - Chapter 7 process than is possible using the original velocity/energy vector theory (i.e. more frequency bands can be used). 3. The use of B-format and higher order Ambisonic encoded signals as a carrier format for Binaural and Transaural reproduction systems has been demonstrated. The optimisation of both Binaural and Transaural techniques through the use of inverse filtering has been formulated, with the transaural reproduction technique benefiting particularly from this technique. Also, a new Ambisonic to four speaker Transaural decode has been formulated and discussed, although sound quality issues have hindered this work, possibly due to the HRTF set used in this research, and so work in this area is still ongoing. 4. Software utilities have been implemented for both the design of decoders for irregular speaker arrays, and the replaying of the Ambisonic carrier signal over: a. Headphones b. Two or four speaker Transaural c. Multi-speaker, optimised, Ambisonic arrays. The details of these achievements are discussed below. 7.2 Ambisonics Algorithm development This project has concentrated on the decoding of a hierarchical based surround sound format based on the Ambisonic system. The traditional method of analysing and optimising Ambisonic decoders is through the use of the energy and velocity vector theories. The algorithmic development in this report, in the most part, has been centred on the use of HRTF data in order to analyse and optimise the performance of the Ambisonic decoders directly. This form of analysis was shown, in Chapter 5, to give results that backed up the original energy and velocity vector theory. - 245 - Chapter 7 0 60 0 0 80 80 0 140 Figure 7.1 Recommended loudspeaker layout, as specified by the ITU. That is, if an Ambisonic decoder was optimised using the energy and velocity vectors, then this result also gave a very good match when analysed using the HRTF method. A number of interesting observations were made from this experiment: • Although a standard ITU five speaker arrangement was used (as shown in Figure 7.1) in the analysis and optimisation stages, the velocity vector analysis gave a perfect low frequency match for the decoder, as shown in Figure 7.2. This was surprising as there is such • a large speaker ‘hole’ at the rear of the rig. However, the HRTF analysis showed some error in the rear of the sound fields reproduction, which seems to show a more realistic result, as demonstrated in Figure 7.3. Figure 7.2 Low frequency (in red) and high frequency (in green) analysis of an optimised Ambisonic decode for the ITU five speaker layout. - 246 - Chapter 7 Time Difference (samples) LF Time Difference 40 20 0 -20 -40 0 50 100 150 200 250 300 350 400 HF Amp Difference Figure 7.3 2 A graph showing a real source’s (in red) and a low frequency decoded source’s (in blue) inter aural time differences. Also, a number of benefits were found due to the inherent increased flexibility of the HRTF analysis technique when compared to the analysis using the energy and velocity vectors. Using the HRTF technique, the effect of head movements could be analysed in a quantitative manner. This can prove invaluable when trying to differentiate between a number of potentially optimal sets of decoder coefficients, and significant differences can be observed. For example, see Figure 7.4 which shows a comparison between two sets of optimised decoder coefficients (using energy and velocity vector theory) and their analytical performance under head rotation. One prominent feature of Figure 7.4 can be seen if the low frequency time difference plots for a source at 00 are observed. The second coefficients response to head rotation shows that the time difference stays at roughly zero samples no matter what direction the listener is facing, indicating that the source is tracking with the listener. However, the first coefficients low frequency graphs shows that the time difference of a source at 00 changes in the same way as a real source would, that is, the source does not track with the listener and more correct cues are presented. - 247 - Chapter 7 Coefficient Set 1 Coefficient Set 2 Figure 7.4 HRTF Simulation of head movement using two sets of decoder coefficients. - 248 - Chapter 7 Such observed variations between different decoders’ analytical performance can give more indications as to how well the decoder will perform than previous techniques allow. Although the Vienna decoding optimisation technique (using the velocity and energy vectors) was proposed in 1992, very little (if any) Vienna decoders have been calculated and used, mainly due to both the mathematical complexity in deriving decoder coefficients using this method and the fact that Gerzon’s paper gave results for a speaker layout very different from the ITU standard, which was proposed after this paper’s publication. To this end, software based on a Tabu search algorithm was developed that, once the five speaker positions were entered, would calculate optimised decoders automatically. This heuristic mechanism has proved a valuable tool, and once the program was written to optimise decoders using the Vienna equations, it could easily be adapted to use the HRTF method, both with and without head-turning considerations. A limited set of formal listening tests have been carried out on a number of decoders optimised using the two techniques described above, as a precursor to further research in this area. Two tests were carried out: 1. Perceived localisation of a panned, dry, source. 2. Decoder preference when listening to an excerpt of a reverberant recording. Although a very small test base was used, decoders optimised using both energy/velocity vectors and HRTF data directly, via the Tabu search algorithm, were shown to outperform the reference decoder in both tests. The best performing decoder in test 1 was an expected result, after observing the performance of the decoder using HRTF data. However, the decoder that was chosen unanimously as the preferred choice when auditioning prerecorded material was not as easy to predict. Reasons for this may be: 1. The most accurate decoder may not be the one that actually sounds best, when replaying pre-recorded material, and will be material dependant. - 249 - Chapter 7 2. It was noticed that the two best (analytically speaking) performing optimised decoders exhibited a slightly uncomfortable, in-head, sound when auditioned in the sweet-spot, which was not apparent with the preferred decoder. This effect disappeared when the listener moved slightly off-centre. This result suggests that when designing decoders artistically, rather than for spatial accuracy, other parameters may need to be taken into account or be available to the user so intuitive control of the decoder can be carried out in order to alter the spatial attributes of the presentation (such as spaciousness and perceived depth, for example). Overall the tests were encouraging and showed that the Ambisonic technique can reproduce phantom images both to the side and behind the listener. However, a much larger test base should be used to further test the new decoders, along with more source positions, due to the reasonably subtle differences between the decoders used in this test (especially for test 1). It has also been shown how this software can be adapted to optimise higher order decoders for irregular arrays, as described by Craven (2003) and two decoders for such a system (using 4th order circular harmonics) are shown below. One suggested by Craven (2003) and another optimised using the Tabu search methodology described above. Decoder optimised using Tabu Search Figure 7.5 Decoder proposed by Craven (2003) Energy and Velocity vector analysis of two 4th order, frequency independent decoders for an ITU five speaker array. The proposed Tabu search’s optimal performance with respect to low frequency vector length and high/low frequency matching of source position can be seen clearly. - 250 - Chapter 7 7.2.1 Further Work This project has raised a number of questions and results that require future work: 1. Altering the coefficients of decoders (i.e. their virtual microphone patterns) can drastically alter how reverberant a recording is perceived to be (as well as altering other spatial attributes). This is probably related to the amount of anti-phase components being reproduced from speakers, but needs further work to the relationship between more complex spatial attributes and decoder coefficients can be formulated.. 2. The uncomfortable, ‘in-head’ perception reported by the listening test subjects when listening to pre-recorded material requires further work which could be coupled into a study of how optimising decoders affects its off-centre performance. 3. Altering the optimisation criterion to take into account off-centre positions could be investigated so determine whether the sweet area of the system can be increased. 4. A study of the higher order decoders, such as the one proposed by Craven (2003), or decoders optimised using the Tabu search method, as described in section 5.3.4, in order to evaluate what effect higher order components have, and whether an upper limit, with respect to harmonic order, can be judged. 7.3 Binaural and Transaural Algorithm Development 7.3.1 B-format to Binaural Conversion The main optimisation method employed using the decoding technologies based on binaural techniques is that of inverse filtering. This is needed for the HRTF set used in this report due to the noticeable colouration of the sound perceived when these HRTFs are used. The inverse filtering technique works well in improving the quality of these filters, while maintaining their performance, as the differences between the ears remain the same and the pinna filtering is likely to be incorrect when compared to that of a listener’s (in fact, the likelihood of the pinna filtering being the same is extremely slim, if not impossible). However, the B-format HRTFs created (see Figure 7.6) do give - 251 - Chapter 7 the impression of a more spatial headphone reproduction, when compared to listening in conventional stereo, even though these are the anechoic forms of the filters. This is especially true when listening to sounds recorded in reverberant fields as the ear/brain system will now receive more coherent cues than when mixing the B-format to it’s stereo equivalent (which is based on mid and side microphone signals and relies on the crosstalk between the ears which is destroyed using headphones – see section 3.2.2 on Blumlein Stereo for more details). Two recordings have been obtained from the company Serendipity (2000) where recordings of the musicians were made in Lincoln Cathedral using both a SoundField microphone and a binaural, in-ear system, simultaneously. Although the binaural recording was not from the same position (it was carried out by Dallas Simpson, a binaural sound artist, who tends to move around often during recordings for artistic effect), a qualitative comparison of the spatial qualities of the two recordings could be made over headphones. Figure 7.6 B-format HRTF filters used for conversion from B-format to binaural decoder. - 252 - Chapter 7 This confirmed that the B-format to binaural system seems to perform favourably when compared to the plain binaural system, although good out of head effects are still difficult to achieve with both recordings. This is not due to algorithmic errors, but to the fact that the ear/brain system isn’t receiving enough coherent cues, and it is interesting as the work by Lake (McKeag & McGrath, 1997) has shown that out of head images are possible using headphones alone. However, they do restrict themselves to recording the impulses of ‘good’ listening rooms for this purpose, with their large hall impulse responses seeming no more out-of-head than their smaller room impulses (Lake DSP, 1997). 7.3.2 Binaural to Two Speaker Transaural Once the B-format to binaural transform has been executed, the resulting two channels can then be played over a transaural reproduction system, employing the filter design techniques outlined and discussed in Chapter 5. The inverse filtered crosstalk cancellation filters perform better when auditioning standard binaural material when compared to binauralised BFormat material, with colouration of the sound being noticeable when replying B-Format in this way, although the colouration is not noticeable when auditioning either the B-Format to binaural, or binaural to crosstalk cancelled material in isolation. As mentioned in Chapter 5, pinna errors seem to worsen the system’s accuracy and, to this end, the Ambiophonics system employs a pinna-less dummy head in the calculation of the inverse filters for the crosstalk cancellation, and in the recording of the event itself (Glasgal, 2001). 7.3.3 Binaural to Four Speaker Transaural The binaural to four speaker transaural system has an interesting effect. The testing of this system has mainly been on the front and rear pair of a standard 5.1 setup as this speaker array is readily available for quick testing (that is, speakers at +/- 300 and +/- 1100). The B-format to four speaker binaural filters are shown in Figure 7.7 where an overall level difference can be seen between the two sets of filters. This is due to the front decode containing the - 253 - Chapter 7 combined response of five speakers and the rear decode containing only the combined response of three, which is due to the virtual speakers at +/- 900 being assigned to the front hemisphere decoder (a regular eight speaker array was simulated). When carrying out A/B comparisons between the two speaker and four speaker systems (note, that the sound colouration problems mentioned above are still present), a number of points are noticeable: • The four speaker crosstalk cancelled decode produces images further • away from the listener. • one would expect from adding the rear speakers). The four speaker decode also has a more open, surrounding sound (as The localisation seems slightly clearer and more precise (although this seems to be a little dependent on the type of material used in testing). Figure 7.7 B-format HRTF filters used for conversion from B-format to binaural decoder. Much of this is probably due to the increase in localisation cue consistency associated with splitting the front and rear portions of the decode and - 254 - Chapter 7 reproducing this from the correct portion of the listening room (that is, the rear speaker feeds come from behind and the front portion of the decode comes from in front), although the ‘moving back’ of the material is an interesting effect: it is not yet certain whether it is a ‘moving back’ of the sound stage or a more realistic sense of depth that is being perceived. It must also be noticed that this effect only occurs when the rear speakers are engaged. That is, it is not noticed when just changing the front pair of speakers’ filters from five to eight speaker virtual decodes, meaning that it is not due to the ‘folding back’ of the rear speakers into the frontal hemisphere in the two speaker, eight virtual speaker, decode. It should also be noted that because the Ambisonic system is designed so that the sum of the speaker outputs at the ear of the listener (in the centre of the array) produce the correct psychoacoustic cues (as far as is possible), this makes it particularly suited to the binaural/transaural playback system, as this should make the system less dependent on the quality of the actual speaker simulation. This is in contrast to the simulation of the five speakers of the 5.1 system over headphones (such as the Lake developed Dolby Headphones system (Lake DSP, 1997)). One other promising feature of the four speaker crosstalk cancellation system is that if the speaker span described above is used (+/- 300 and +/- 1100), although the most ‘correct’ listening experience is found in the middle of the rig, the system still produces imaging outside of this area. This is in contrast to the single +/- 30 speaker placement that, although possessing very good imaging in the sweet area, has virtually no imaging off this line. This would make this setup more desirable for home use where other listeners could still get a reasonable approximation to the sound field, but with the central listener experiencing an improved version. However, it must also be noted that, as mentioned in chapter 5, the virtual imaging of the filters created for +/- 300 is not as accurate as those created for a smaller span (such as +/- 30), although its frequency response does not lack (or boost depending on the level of inverse filtering used) lower frequencies as much. - 255 - Chapter 7 7.3.4 Further Work A number of optimisations have been suggested for the crosstalk cancellation system, where much less work has been carried out when compared to standard binaural audio reproduction systems, mostly striving for the minimisation of the use of the regularisation parameter as described by Kirkby et al. (1999) and Farina et al. (2001). This is because, although regularisation accounts for any ill-conditioning that the system may possess, it is at the expense of crosstalk cancellation accuracy. This can have the effect of the images pulling towards the speakers at these frequencies (Kirkby et al, 1999). In this report a number of inverse filtering steps were taken where single inversion was used to reduce regularisation, and double inversion used to remove the need for regularisation completely. However, this has the effect of altering the frequency response of the crosstalk cancelled system quite noticeably when the speakers are set up in an optimum configuration (that is, closely spaced). Nevertheless, this is still not the whole picture. The single inverted filters show (mathematically speaking) that no bass boost is perceived by the listener, although it is noticed in reality, and the double inverse filtering takes away too much bass response. A filter part way between these two extremes is needed, and this is the next step in the development of the crosstalk cancellation filter structures. Also, much work is still needed in how it is that the listener actually perceives the sound stage of a crosstalk cancelled system as a number of interesting ‘features’ have been noted during informal listening tests. • When listening to straight binaural pieces (where the crosstalk cancellation system still works best), good distance perception is apparent, with sources able to appear closer and further away than the speakers actually • are. Room reflections can have an interesting effect on the playback. If the two speakers are against the wall, then the perceived material is, for the most part (see above), located in a semi-circle around the front of the listener. However, if the speakers are moved inwards, then the material is generally still perceived towards the back of the room. In this way, it is as if the room is superimposed onto the recorded material. - 256 - Chapter 7 These are two situations that need further investigation, as they may hold more clues as to our distance perception models, one attribute that can be difficult to synthesise in audio presentations. Overall, it is the original Ambisonic system that sounds the most natural, although much of this could be attributed to the filters used in the HRTF processing. With filters recorded in a non-anechoic room and a better speaker/microphone combination it may be possible to achieve a more out-ofhead experience, especially if accompanied with some form of head-tracking, where the rotation could be carried out using a standard B-format transformation, removing the need for complex dynamic filter changing in realtime (where careful interpolation is needed to eliminate audible artefacts when moving between the different HRTF filter structures) as recently demonstrated by Noisternig et al (2003). - 257 - References Chapter 8 - References Alexander, R.C. (1997) Chapter Three – The Audio Patents. Retrieved: May, 2003, from http://www.doramusic.com/chapterthree.htm, Focal Press. Atal, B.S. (1966) Apparent Sound Source Translator. US Patent 3236949. Bamford, J.S. (1995) An Analysis of Ambisonic Sound Systems of First and Second Order, Master of Science thesis, University of Waterloo, Ontario, Canada. Begault, D.R. (2000) 3-D Sound for Virtual Reality and Multimedia, Retrieved: March, 2003, from http://humanfactors.arc.nasa.gov/ihh/spatial/papers/pdfs_db/Begault_2000_3d_Sound_Mu ltimedia.pdf, NASA. Berg, J., Rumsey, R. (2001) Verification and Correlation of Attributes Used For Describing the Spatial Quality of Reproduced Sound. Proceedings of the 19th International AES Conference, Germany. p. 233 – 251. Berkhout, A.J. et al. (1992) Acoustic Control by Wave Field Synthesis. Journal of the AES, Vol. 93, Num. 5, p. 2765 – 2778. Berry, S. & Lowndes V. (2001) Deriving a Memetic Algorithm to Solve Heat Flow Problems. University of Derby Technical Report. Blauert, J. (1997) Spatial Hearing – The Psychophysics of Human Sound Localization, MIT Press, Cambridge. Blumlein, A. (1931) Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems, British Patent Application 394325. - 258 - References Borland Software Corporation (2003) C++ Builder Studio Main Product Page. Retrieved: August, 2003, from http://www.borland.com/cbuilder/index.html. Borwick, J. (1981) Could ‘Surround Sound’ Bounce Back. The Gramophone, February, p 1125-1126. Brown, C. P. & Duda, R. O. (1997) An Efficient HRTF Model for 3-D Sound, Retrieved: April, 2003, from http://interface.cipic.ucdavis.edu/PAPERS/Brown1997(Efficient3dHRTFModel s).pdf. CMedia (N.D.) An Introduction to Xear 3D™Sound Technology, Retrieved: July, 2004 from http://www.cmedia.com.tw/doc/Xear%203D.pdf Craven, P.G., Gerzon, M.A. (1977) Coincident Microphone Simulation Covering Three Dimensional Space and Yielding Various Directional Outputs, U.S. Patent no 4042779. Craven, P. (2003), Continuous Surround Panning for 5-speaker Reproduction, th AES 24 International Conference, Banff, Canada. Daniel, J. et al. (2003) Further Investigations of High Order Ambisonics and Wavefield Synthesis for Holophonic Sound Imaging. 114th AES Convention, Amsterdam. Preprint 5788 De Lancie, P. (1998) Meridian Lossless Packing:Enabling High-Resolution Surround on DVD-Audio. Retrieved: July, 2004 from http://www.meridianaudio.com/p_mlp_mix.htm. Dolby Labs (2002) A history of Dolby Labs. Retrieved: June, 2003, from http://www.dolby.com/company/is.ot.0009.History.08.html. Dolby Labs (2004) Dolby Digital – General. Retreived: July, 2004 from http://www.dolby.com/digital/diggenl.html. - 259 - References Duda (1993) Modeling Head Related Transfer Functions. Preprint for the 27th Asilomar Conference on Signals, Systems & Computers, Asilomar, October 31st – November 3rd. Farina, A. et al. (2001) Ambiophonic Principles for the Recording and Reproduction of Surround Sound for Music. Proceedings of the 19th AES International Conference of Surround Sound, Schloss Elmau, Germany, p. 2646. Farino A., Ugolotti E. (1998) Software Implementation of B-Format Encoding and Decoding. Preprints of the 104th International AES Convention, Amsterdam, 15 – 20 May. Farrah, K. (1979a) Soundfield Microphone – Design and development of microphone and control unit. Wireless World, October, p. 48-50. Farrar, K. (1979b) Soundfield Microphone. Parts 1 & 2. Wireless World, October & November. p. 48 – 50 & p. 99 – 103 Kramer, L. (N.D.) DTS: Brief History and Technical Overview. Retrieved: July, 2004 from http://www.dtsonline.com/media/uploads/pdfs/history,whitepapers,downloads. pdf. Furse, R. (n.d.) 3D Audio Links and Information. Retrieved: May, 2003, from http://www.muse.demon.co.uk/3daudio.html. Gardner B., Martin K. (1994) HRTF Measurements of a KEMAR DummyHead Microphone, Retrieved: May, 2003, from http://sound.media.mit.edu/KEMAR.html. Gerzon, M. A. (1974a) Sound Reproduction Systems. Patent No. 1494751. - 260 - References Gerzon, M. A. (1974b) What’s wrong with Quadraphonics. Retrieved: July, 2004 from http://www.audiosignal.co.uk/What's%20wrong%20with%20quadraphonics.ht ml Gerzon, M.A. (1977a) Sound Reproduction Systems. UK Patent No. 1494751. Gerzon, M. A. (1977b) Multi-system Ambisonic Decoder, parts 1 & 2. Wireless World, July & August. p. 43 – 47 & p. 63 – 73. Gerzon, M.A. (1985) Ambisonics in Multichannel Broadcasting and Video. Journal of the Audio Engineering Society, Vol. 33, No. 11, p. 851-871. Gerzon, M. A. & Barton, G. J. (1992) Ambisonic Decoders for HDTV. Proceedings of the 92nd International AES Convention, Vienna. 24 – 27 March. Preprint 3345. Gerzon, M.A. (1992a) Optimum Reproduction Matrices for Multispeaker Stereo. Journal of the AES, Vol. 40, No. 7/8, p. 571 – 589. Gerzon M. (1992b) Psychoacoustic Decoders for Multispeaker Stereo and Surround Sound. Proceedings of the 93rd International AES Convention, San Francisco. October Preprint 3406 Gerzon, M.A. (1992c) General Methatheory of Auditory Localisation. 92nd International AES Convention, Vienna, 24 – 27 March Preprint 3306. Gerzon, M.A. (1994) Application of Blumlein Shuffling to Stereo Microphone Techniques. Journal of the AES, vol. 42, no. 6, p. 435-453. Gerzon, M.A, Barton, G.J. (1998) Surround Sound Apparatus. U.S. Patent No. 5,757,927 - 261 - References Glasgal, R. (2001) The Ambiophone - Derivation of a Recording Methodology Optimized for Ambiophonic Reproduction. Proceedings of the 19th AES International Conference, Germany, 21 – 24 June. p. 13-25. Glasgal, R. (2003a) The Blumlein Conspiracy. Retrieved: August, 2003, from http://www.ambiophonics.org/blumlein_conspiracy.htm. Glasgal, R. (2003b) AmbioPhonics – Chapter 4, Pinna Power. Retrieved: June, 2003, from http://www.ambiophonics.org/Ch_4_ambiophonics_2nd_edition.htm. Glasgal, R. (2003c) Ambiophonics - The Science of Domestic Concert Hall Design. Retrieved: May, 2003, from http://www.ambiophonics.org. Gulick, W.L. et al. (1989) Hearing – Physiological Acoustics, Neural Coding, and Psychoacoustics, Oxford University Press, New York. Huopaniemi, J. et al (1999) Objective and Subjective Evaluation of HeadRelated Transfer Function Filter Design. Journal of the Audio Engineers Society, Vol 47, No. 4, p218-239 Inanaga, K. et al. (1995) Headphone System with Out-of-Head Localisation Applying Dynamic HRTF. 98th International AES Convention, Paris, 25 – 28 February. Preprint 4011. Intel Corporation (2003a), Intel Corporation. Retrieved: June, 2003, from http://www.intel.com. Intel Corporation (2003b) Intel® Software Development Projects. Retreived: August, 2003, from http://www.intel.com/software/products/ipp/ipp30/index.htm. Ircam (2002) Carrouso. Retrieved: July, 2004, from http://www.ircam.fr/produits/technologies/CARROUSO-e.html - 262 - References Kahana, Y. et al (1997). Objective and Subjective Assessment of Systems for the Production of Virtual Acoustic Images for Multiple Listeners. 103rd AES Convention, New York, September. Preprint 4573 Kay, J. et al. (1998) Film Sound History – 40’s. Retrieved: August, 2003, from http://www.mtsu.edu/~smpte/forties.html. Kientzle, T. (1997) A Programmer’s Guide to Sound, Addison Wesley. New York. Kirkeby, O. et al. (1999) Analysis of Ill-Conditioning of Multi-Channel Deconvolution Problems. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New York. 17 – 20 October Kleiner, M. (1978) Problems in the Design and Use of ‘Dummy-Heads’. Acustica, Vol. 41, p. 183-193. Lake DSP (1997) Lake DSP Acoustic Explorer CD/CD-ROM v2, Lake DSP Pty. Ltd. Leese, M. (n.d.) Ambisonic Surround Sound. Retrieved: August, 2003, from http://members.tripod.com/martin_leese/Ambisonic/ Leitner et al (2000) Multi-Channel Sound Reproduction system for Binaural signals – The Ambisonic Approach. Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00., Verona, Italy, December, p. 277 – 280. Lopez, J.J., Gonzalez, A. (2001) PC Based Real-Time Multichannel Convolver for Ambiophonic Reproduction. Proceedings of the 19th International Conference of Surround Sound, Germany, 21 – 24 June. p. 47-53. - 263 - References Mackerson, P. et al. (1999) Binaural Room Scanning – A New Tool for Acoustic and Psychoacoustic Research. Retrieved: May, 2003, from http://www.irt.de/wittek/hauptmikrofon/theile/BRS_DAGA_1999_Paper.PDF. Malham, D. (1998) Spatial Hearing Mechanisms and Sound Reproduction. Retrieved: June,2003, from http://www.york.ac.uk/inst/mustech/3d_audio/ambis2.htm. Malham, D. (2002) Second and Third Order Ambisonics. Retrieved: August, 2003, from http://www.york.ac.uk/inst/mustech/3d_audio/secondor.html. Martin, G., et al. (2001) A Hybrid Model For Simulating Diffused First Reflections in Two-Dimensional Synthetic Acoustic Environments. Proceedings of the 19th International AES Conference, Germany. p. 339 – 355. Mason, R., et al (2000) Verbal and non-verbal elicitation techniques in the subjective assessment of spatial sound reproduction. Presented at 109th AES Convention, Los Angeles, 22-25 September. Preprint 5225. McGriffy, D (2002) Visual Virtual Microphone. Retrieved: August, 2003, from, http://mcgriffy.com/audio/ambisonic/vvmic/. McKeag, A., McGrath, D. (1996) Sound Field Format to Binaural Decoder with Head-Tracking. 6th Austrailian Regional Convention of the AES, Melbourne, Austrailia. 10 – 12 September. Preprint 4302. McKeag, A., McGrath, D.S. (1997) Using Auralisation Techniques to Render 5.1 Surround To Binaural and Playback. 102nd AES Convention in Munich, Germany, 22 – 25 March. preprint 4458 Microphone Techniques (n.d.). Retrieved: August, 2003, from http://www.mhsoft.nl/MicTips.asp, - 264 - References Microsoft Corporation (2003), Retrieved: June 2003, from http://www.microsoft.com/windows/. MIT Media Lab (2000) MPEG-4 Structured Audio (MP4 Structured Audio). Retrieved: August, 2003, from http://sound.media.mit.edu/mpeg4/. Moller, H. et al. (1996) Binaural Technique: Do We Need Individual Recordings? Journal of the AES, Vol. 44, No. 6, p. 451 – 468. Moller, H. et al. (1999) Evaluation of Artificial Heads in Listening Tests. J. Acoust. Soc. Am. 47(3), p. 83-100. Multi Media Projekt Verdi (2002) Design of the Listening Test. Retrieved: July, 2004 from http://www.stud.tu-ilmenau.de/~proverdi/indexen.html. Nelson, P.A. et al. (1997) Sound Fields for the Production of Virtual Acoustic Images. Journal of Sound and Vibration, Vol. 204(2), p. 386-396. Nielsen, S. (1991) Depth Perception – Finding a Design Goal for Sound Reproduction systems. 90th AES Convention, Paris. Preprint 3069. Noisternig, M. et al. (2003) A 3D Ambisonic Based Binaural Sound Reproduction System. Proceedings of the 24th International Conference on Multichannel Audio, Banff, Canada. Paper 1 Orduna, F. et al. (1995) Subjective Evaluation of a Virtual Source Emulation System. Proceedings of Active 95, Newport Beach, USA. P. 1271 – 1278. Paterson-Stephens I., Bateman A. (2001) The DSP Handbook, Algorithms, Applications and Design Techniques, Prentice Hall. Harlow Petzold, C. (1998) Programming Windows – The definitive guide to the Win32 API, Microsoft Press, New York. - 265 - References Poletti, M. (2000) A Unified Theory of Horizontal Holographic Sound Systems. Journal of the AES, Vol. 48, No. 12, p. 1155 – 1182. Pulkki, V. (1997) Virtual sound source positioning using vector base amplitude panning. Journal of the Audio Engineering Society, Vol. 45, No. 6 p. 456-466. Rossing T. (1990) The Science of Sound, Addison Wesley. Reading Rumsey, F., McCormick, T. (1994) Sound & Recording – an introduction, Focal Press. Oxford Ryan, C. and Furlong, D. (1995) Effects of headphone placement on headphone equalisation for binaural reproduction. 98th International Convention of the Audio Engineering Society, Paris, 25 – 28 February. preprint no. 4009. Savioja, L (1999) Air Absorption. Retrieved: July, 2004, from http://www.tml.hut.fi/~las/publications/thesis/Air_Absorption.html. Schillebeeckx, P. et al. (2001) Using Matlab/Simulink as an implementation tool for Multi-Channel Surround Sound. Proceedings of the 19th International AES conference on Surround Sound, Schloss Elmau, Germany, 21 – 25 June. p. 366-372. Serendipity (2000) SERENDIPITY- Audio, Music, Recording and Mastering Studio. Retrieved: August, 2003, from http://www.seripity.demon.co.uk/. Sibbald, A. (2000) Virtual Audio for Headphones. Retrieved: July 2004, from http://www.sensaura.com/whitepapers/pdfs/devpc007.pdf Sontacchi, A., Holdrich, R. (2003) Optimization Criteria For Distance Coding in 3D Sound Fields. 24th International AES Conference on Multichannel Audio, Banff. Paper 32. - 266 - References SoundField Ltd. (n.d. a) SP451 Surround Sound Processor. Retrieved: August, 2003, from http://www.soundfield.com/sp451.htm. SoundField Ltd. (n.d. b). Retrieved: August, 2003, from http://www.soundfield.com. Spikofski, G., Fruhmann, M. (2001) Optimization of Binaural Room Scanning (BRS): Considering inter-individual HRTF-characteristics. In: Proceedings of the AES 19th International Conference, Schloss Elmau, Germany 21 – 25 June. p.124-134. Steinberg, J., Snow, W. (1934) Auditory Perspective – Physical Factors. In: Electrical Engineering, January, p.12-17. Surround Sound Mailing List Archive (2001), Retrieved: June, 2003, from http://www.tonmeister.de/foren/surround/ssf_archiv/SSF_Diskussion_2001_12 _2.pdf, p. 5. Sydec Audio Engineering (2003), Retrieved: June 2003, from http://www.sydec.be. The MathWorks (2003), Retrieved: June 2003, from http://www.mathworks.com/. Theile, G. (2001) Multi-channel Natural Music Recording Based on Psychoacoustic Principles. Extended version of the paper presented at the AES 19th International Conference. Schloss Elmau, Germany, 21 – 25 June. Retrieved: May, 2003, from http://www.irt.de/IRT/FuE/as/multi-mr-ext.pdf. University of Erlangen-Nuremberg (N.D), Wave Field Synthesis and Analysis, Retrieved: July, 2004 from http://www.lnt.de/LMS/research/projects/WFS/index.php?lang=eng - 267 - References Verheijen, E.N.G. et al. (1995) Evaluation of Loudspeaker Arrays for Wave Field Synthesis in Audio Reproduction. 98th International AES Convention, Paris, 25 – 28 February. preprint 3974. Vermeulen, J. (n.d.) The Art of Optimising – Part 1. Retrieved: August, 2003, from http://www.cfxweb.net/modules.php?name=News&file=article&sid=630. Wiggins, B. et al. (2001) The analysis of multi-channel sound reproduction algorithms using HRTF data. 19th International AES Surround Sound Convention, Schloss Elmau, Germany, 21 – 24 June. p. 111-123. Wiggins, B. et al. (2003) The Design and Optimisation of Surround Sound Decoders Using Heuristic Methods. Proceedings of UKSim 2003, Conference of the UK Simulation Society p.106-114. Wightman, F.L. and Kistler, D.J. (1992). The Dominant Role of LowFrequency Interaural Time Differences in Sound Localization. Journal of the Acoustical Society of America 91(3), p. 1648-1661 Zacharov, N. et al. (1999) Round Robin Subjective Evaluation of Virtual Home Theatre Sound Systems At The AES 16th International Conference. Proceedings of the 16th AES International Conference, Finland. p. 544 – 553. Zwicker, E., Fastl, H. (1999) Psychoacoustics – Facts and Models, Springer. Berlin. - 268 - Appendix Chapter 9 - Appendix In this appendix, example code is given for selected programs used in this investigation. A list of all code is not given due to the extensive amount of C and Matlab code used during this research, but significant programs are given so as to aid in the reproduction of the programs that are not present. The Matlab script code is given in the first part of this appendix, followed by two programs written in C++ for the Windows operating system. 9.1 Matlab Code 9.1.1 Matlab Code Used to Show Phase differences created in Blumlein’s Stereo %Blumlien Stereo Phase differences %Showing amplitude differences at a %speaker converted to phase differences %at the ears of a listener N = 1024; fs = 1024; n=0:N; f = 2; %Create Left and Right Speaker Feeds %Along with phase shifted versions Left = sin(f*2*pi*n/fs); Leftd = sin(f*2*pi*n/fs - pi/2); Right = 0.3 * sin(f*2*pi*n/fs); Rightd = 0.3 * sin(f*2*pi*n/fs - pi/2); %Sum Example Signals arriving at Ears LeftEar = Left + Rightd; RightEar = Right + Leftd; %Plot Speaker Signals figure(1) clf; subplot(2,1,1) plot(Left); hold on plot(Right,'r'); legend('Left Speaker','Right Speaker'); ylabel('Amplitude'); xlabel('Samples'); axis([0 N -1.2 1.2 ]); %Plot Signals Arriving at Ears subplot(2,1,2) plot(LeftEar); hold on; plot(RightEar,'r'); legend('Left Ear','Right Ear'); - 269 - Appendix ylabel('Amplitude'); xlabel('Samples'); axis([0 N -1.2 1.2 ]); 9.1.2 Matlab Code Used to Demonstrate Simple Blumlein Spatial Equalisation %Example of Blumleins Spatial Equalisation %used to align auditory cues in Stereo angle=0:2*pi/127:2*pi; Sum Dif = sin(angle); = cos(angle); Left Right = (Sum - Dif)/1.13; = (Sum + Dif)/1.13; %Angle Offset used in spatial EQ offset = pi/16; %Derive Left and Right Speaker feeds for both %Low and High frequencies SumL = (sin(pi/4-offset)*Sum+cos(pi/4-offset)*Dif); SumH = (sin(pi/4)*Sum+cos(pi/4)*Dif); %Plot Mid and Side Signals figure(1) clf; polar(angle,abs(Sum)); hold on polar(angle,abs(Dif),'r'); legend('Mid','Side'); FSize = 16; Co = 0.4; text(Co,0,'+','FontSize',FSize); text(-Co,0,'-','FontSize',FSize+4); text(0,-Co,'+','FontSize',FSize); text(0,Co,'-','FontSize',FSize+4); %Plot M+S and M-S figure(2) clf; polar(angle,abs(Right)); hold on polar(angle,abs(Left),'r'); legend('Sum of MS','Difference of MS'); FSize = 16; Co = 0.5; text(0,Co,'+','FontSize',FSize); text(0,-Co,'-','FontSize',FSize+4); text(Co,0,'+','FontSize',FSize); text(-Co,0,'-','FontSize',FSize+4); %Plot Low and High Frequency Versions %of the Left and Right Speaker Feeds figure(3) clf; polar(angle,abs(SumL)); hold on; polar(angle,abs(SumH),'r'); - 270 - Appendix legend('Low Frequency Pickup','High Frequency Pickup'); 9.1.3 Matlab Code Used To Plot Spherical Harmonics %Plot 0th and 1st Order Spherical Harmonics %Reolution N=32; %Setup Angle Arrays Azim = 0:2*pi/(N-1):2*pi; Elev = -pi/2:pi/(N-1):pi/2; %Loop Used to create Matrices representing X,Y,Z and %Colour Values for W,X,Y and Z B-format signals a=1; b=1; for i=2:N for j=2:N r=1/sqrt(2); [WX(a ,b ),WY(a ,b ),WZ(a ,b )]= ... sph2cart(Azim(i-1),Elev(j-1),1/sqrt(2)); [WX(a+1,b ),WY(a+1,b ),WZ(a+1,b )]= ... sph2cart(Azim(i-1),Elev(j ),1/sqrt(2)); [WX(a+2,b ),WY(a+2,b ),WZ(a+2,b )]= ... sph2cart(Azim(i ),Elev(j ),1/sqrt(2)); [WX(a+3,b ),WY(a+3,b ),WZ(a+3,b )]= ... sph2cart(Azim(i ),Elev(j-1),1/sqrt(2)); [WX(a+4,b ),WY(a+4,b ),WZ(a+4,b )]= ... sph2cart(Azim(i-1),Elev(j-1),1/sqrt(2)); if(r>=0) WC(:,b)=[1;1;1;1;0]; else WC(:,b)=[0;0;0;0;0]; end r=cos(Azim(i-1))*cos(Elev(j-1)); [XX(a ,b ),XY(a ,b ),XZ(a ,b )]= ... sph2cart(Azim(i-1),Elev(j-1),abs(r)); r=cos(Azim(i-1))*cos(Elev(j)); [XX(a+1,b ),XY(a+1,b ),XZ(a+1,b )]= ... sph2cart(Azim(i-1),Elev(j ),abs(r)); r=cos(Azim(i ))*cos(Elev(j)); [XX(a+2,b ),XY(a+2,b ),XZ(a+2,b )]= ... sph2cart(Azim(i ),Elev(j ),abs(r)); r=cos(Azim(i ))*cos(Elev(j-1)); [XX(a+3,b ),XY(a+3,b ),XZ(a+3,b )]= ... sph2cart(Azim(i ),Elev(j-1),abs(r)); r=cos(Azim(i-1))*cos(Elev(j-1)); [XX(a+4,b ),XY(a+4,b ),XZ(a+4,b )]= ... sph2cart(Azim(i-1),Elev(j-1),abs(r)); if(r>=0) XC(:,b)=[1;1;1;1;0]; else XC(:,b)=[0;0;0;0;0]; end r=sin(Azim(i-1))*cos(Elev(j-1)); [YX(a ,b ),YY(a ,b ),YZ(a ,b )]= ... sph2cart(Azim(i-1),Elev(j-1),abs(r)); r=sin(Azim(i-1))*cos(Elev(j)); [YX(a+1,b ),YY(a+1,b ),YZ(a+1,b )]= ... sph2cart(Azim(i-1),Elev(j ),abs(r)); r=sin(Azim(i ))*cos(Elev(j)); - 271 - Appendix [YX(a+2,b ),YY(a+2,b ),YZ(a+2,b )]= ... sph2cart(Azim(i ),Elev(j ),abs(r)); r=sin(Azim(i ))*cos(Elev(j-1)); [YX(a+3,b ),YY(a+3,b ),YZ(a+3,b )]= ... sph2cart(Azim(i ),Elev(j-1),abs(r)); r=sin(Azim(i-1))*cos(Elev(j-1)); [YX(a+4,b ),YY(a+4,b ),YZ(a+4,b )]= ... sph2cart(Azim(i-1),Elev(j-1),abs(r)); if(r>=0) YC(:,b)=[1;1;1;1;0]; else YC(:,b)=[0;0;0;0;0]; end r=sin(Elev(j-1)); [ZX(a ,b ),ZY(a ,b ),ZZ(a ,b )]= ... sph2cart(Azim(i-1),Elev(j-1),abs(r)); r=sin(Elev(j)); [ZX(a+1,b ),ZY(a+1,b ),ZZ(a+1,b )]= ... sph2cart(Azim(i-1),Elev(j ),abs(r)); r=sin(Elev(j)); [ZX(a+2,b ),ZY(a+2,b ),ZZ(a+2,b )]= ... sph2cart(Azim(i ),Elev(j ),abs(r)); r=sin(Elev(j-1)); [ZX(a+3,b ),ZY(a+3,b ),ZZ(a+3,b )]= ... sph2cart(Azim(i ),Elev(j-1),abs(r)); r=sin(Elev(j-1)); [ZX(a+4,b ),ZY(a+4,b ),ZZ(a+4,b )]= ... sph2cart(Azim(i-1),Elev(j-1),abs(r)); if(r>=0) ZC(:,b)=[1;1;1;1;0]; else ZC(:,b)=[0;0;0;0;0]; end b=b+1; end end %Plot W figure(1) fill3(WX,WY,WZ,WC); light; lighting phong; shading interp; axis equal axis off; view(-40,30); axis([-1 1 -1 1 -1 1]); %Plot X figure(2) fill3(XX,XY,XZ,XC); light; lighting phong; shading interp; axis equal axis off; view(-40,30); axis([-1 1 -1 1 -1 1]); %Plot Y figure(3) fill3(YX,YY,YZ,YC); light; - 272 - Appendix lighting phong; shading interp; axis equal axis off; view(-40,30); axis([-1 1 -1 1 -1 1]); %Plot Z figure(4) fill3(ZX,ZY,ZZ,ZC); light; lighting phong; shading interp; axis equal axis off; view(-40,30); axis([-1 1 -1 1 -1 1]); 9.1.4 Code used to plot A-format capsule responses (in 2D) using oversampling. %scaling sc=1.5; %oversampling fsmult = 64; %number of capsules noofcaps = 4; %sampling frequency fs = 48000 * fsmult; h=figure(1) h1=figure(3) set(h,'DoubleBuffer','on'); set(h1,'DoubleBuffer','on'); i=0; %capsule spacing spacing = 0.012; %resolution N=360*32; n=0:2*pi/(N-1):2*pi; n=n'; AOffset = 2*pi/(2*noofcaps):2*pi/(noofcaps):2*pi; POffsetx = spacing * cos(AOffset); POffsety = spacing * sin(-AOffset); xplot = zeros(N,noofcaps); yplot = zeros(N,noofcaps); for a=1:noofcaps CPolar = 0.5*(2+cos(n+AOffset(a))); [xplot(:,a),yplot(:,a)] = pol2cart(n,CPolar); xplot(:,a) =xplot(:,a) + POffsetx(a); yplot(:,a) =yplot(:,a) + POffsety(a); end %For loop uncomment out next line and comment out %the SignalAngle = 5... for SignalAngle = 0:2*pi/32:2*pi; %SignalAngle = deg2rad(0); i=i+1; figure(1) - 273 - Appendix clf hold on; plot(xplot,yplot,'LineWidth',1.5); signalx = cos(SignalAngle) * 2; signaly = sin(SignalAngle) * 2; plot([signalx,0],[signaly,0]); axis equal; title('Polar Diagram of A-Format and signal direction'); GainIndex = round(SignalAngle*(N-1)/(2*pi))+1; pos = 1; for a=1:noofcaps if a > noofcaps/4 & a <= 3 * noofcaps / 4 pos = -1; else pos = 1; end plot(xplot(GainIndex,a),yplot(GainIndex,a),'p','LineWidth',3); Gain(a) = sqrt((xplot(GainIndex,a)-POffsetx(a))^2 ... + (yplot(GainIndex,a)-POffsety(a))^2); Gain8(a) = (sqrt((xplot(GainIndex,a)-POffsetx(a))^2 ... + (yplot(GainIndex,a)-POffsety(a))^2)) * pos; end axis([-sc,sc,-sc,sc]); Delay = spacing - (spacing * Gain); SDelay = (Delay*fs/340) + (spacing*fs/340) + 1; FilterBank = zeros(round(2*spacing*fs/340) + 1,1); FilterBank8 = zeros(round(2*spacing*fs/340) + 1,1); for a=1:noofcaps FilterBank(round(SDelay(a))) = ... FilterBank(round(SDelay(a))) + Gain(a)/2; FilterBank8(round(SDelay(a))) = ... FilterBank8(round(SDelay(a))) + Gain8(a)*sqrt(2); CD(a) = Delay(a); CG(a) = Gain(a); end figure(3) clf; subplot(2,1,1) stem(FilterBank); ylim([-4 4]); hold on; stem(FilterBank8,'r'); title('Omni and Figure of 8 impulses (8 imp taken from X rep)'); subplot(2,1,2) invFB = inversefilt(FilterBank); f = 20*log10(abs(fft(FilterBank/noofcaps,512*fsmult))); g = 20*log10(abs(fft(FilterBank8/noofcaps,512*fsmult))); h = 1./f; x = 120; plot(0:24000/255:24000,f(1:512*fsmult/(2*fsmult))) text(x*24000/255,f(x),'\leftarrow Omni Rep', ... 'HorizontalAlignment','left'); hold on; plot(0:24000/255:24000,g(1:512*fsmult/(2*fsmult)),'r') text(x*24000/255,g(x),'Figure of 8 Rep \rightarrow', ... 'HorizontalAlignment','right'); - 274 - Appendix title('Omni and Figure of 8 responses'); ylim([-20 6]); xlim([0 24000]); xlabel('Frequency (Hz)'); ylabel('Amplitude (dB)'); pause(0.1); %remember to uncomment me too!! end figure(2) clf; Wx = (xplot(:,1) Wy = (yplot(:,1) Xx = (xplot(:,1) Xy = (yplot(:,1) Yx = (xplot(:,1) Yy = (yplot(:,1) + + + + - xplot(:,2) yplot(:,2) xplot(:,2) yplot(:,2) xplot(:,2) yplot(:,2) + + - xplot(:,3) yplot(:,3) xplot(:,3) yplot(:,3) xplot(:,3) yplot(:,3) + + + + xplot(:,4))/2; yplot(:,4))/2; xplot(:,4))*sqrt(2); yplot(:,4))*sqrt(2); xplot(:,4))*sqrt(2); yplot(:,4))*sqrt(2); plot(Wx,Wy); hold on plot(Xx,Xy,'m'); plot(-Xx,-Xy,'m'); plot(Yx,Yy,'r'); plot(-Yx,-Yy,'r'); axis equal; title('Reconstructed polar diagram of B Format'); x = 0.5; text(x,0,'+X'); text(-x,0,'-X'); text(0,x,'+Y'); text(0,-x,'-Y'); 9.1.5 Code Used to Create Free Field Crosstalk Cancellation Filters %Create matlab free field dipole filters %Speakers = +/- 30 deg %Distance = 1m %Mic spacing radius = 7 cm (head radius) %Filter Size N = 1024; %Mic Spacing Radius MSpacing = 0.07; %Speaker spacing +/- n degrees SSpacing = 30; %Sampling Frequency fs = 96000; %Speed of Sound in Air c = 342; %Middle of Head x & y co-ords (speaker is at origin, symmetry %assumed) x = sin(deg2rad(SSpacing)); y = cos(deg2rad(SSpacing)); %Left and Right Mic Coords xr = x - MSpacing; yr = y; xl = x + MSpacing; - 275 - Appendix yl = y; %Calculate Distances from origin (speaker) rdist = sqrt(xr*xr + yr*yr); ldist = sqrt(xl*xl + yl*yl); %Calculate Amplitude difference at mics using inverse square law ADif = 1-(ldist-rdist); %Convert distance to time using speed of sound rtime = rdist/c; ltime = ldist/c; timedif = ltime - rtime; %Convert time to number of samples sampdif = round(timedif * fs); %Create filters h1=zeros(1,N); count=1; for a=1:N if a==1 h1(a) = 1; count=count+2; elseif round(a/(sampdif*2))==a/(sampdif*2) h1(a+1) = ADif^count; count=count+2; end end ht = zeros(1,sampdif+1); ht(sampdif+1) = -ADif; h2=conv(h1,ht); %Plot Time Domain Representation figure(1) clf; a=stem(h1); hold on b=stem(h2,'r'); set(a,'LineWidth',2); set(b,'LineWidth',2); title(['x-talk filters at +/- ',num2str(SSpacing),' degrees']); legend('h1',' ','h2',' '); ylabel('Amplitude'); xlabel('Sample Number (at 96kHz, c = 342ms-1)'); axis([0 1024 -1.05 1.05]); %Plot Frequency Domain Respresentation figure(2) clf; freq=0:fs/(N-1):fs; plot(freq,20*log10(abs(fft(h1))),'LineWidth',2); hold on plot(freq,20*log10(abs(fft(h2,1024))),'r:','LineWidth',2); xlim([0 fs/4]); title(['Frequency Response at +/- ',num2str(SSpacing),' degrees']); xlabel('Frequency (Hz)'); ylabel('Amplitude (dB)'); legend('h1','h2'); 9.1.6 Code Used to Create Crosstalk Cancellation Filters Using HRTF Data and Inverse Filtering Techniques pinna = 1; d = 'd:\matlab\hrtf\ofull\elev0\'; - 276 - Appendix ref = wavread([d, 'L0e175a.wav']); refR = ref(:,pinna); ref = wavread([d, 'L0e185a.wav']); refL = ref(:,pinna); hrtf = wavread([d, 'L0e175a.wav']); hrtfR = hrtf(:,pinna); hrtf = wavread([d, 'L0e185a.wav']); hrtfL = hrtf(:,pinna); len=4096; temp=zeros(1,len); offset=2048; temp(offset:offset-1+length(hrtfL))=refL; iL=inversefilt(temp); win=hanning(len); iL=iL.*win'; figure(5) clf; plot(iL); hold on plot(win); L2 = conv(hrtfL,iL); R2 = conv(hrtfR,iL); win=hanning(length(L2)); L2=L2.*win'; R2=R2.*win'; figure(1) clf; plot(L2); hold on plot(R2,'r'); figure(2) clf; freqz(L2); figure(3) clf; freqz(R2); [h1,h2] = freqdip([L2'],[R2'],len,0,0); h1inv = inversefilt(h1,0.0); h1i = conv(h1,h1inv); h2i = conv(h2,h1inv); h1i = h1i((len-1024):(len+1023)); h2i = h2i((len-1024):(len+1023)); win = hanning(length(h1i)); h1i = h1i .* win; h2i = h2i .* win; figure(6) plot([h1i,h2i]); h1i48 = resample(h1i,48000,44100); h2i48 = resample(h2i,48000,44100); h148 = resample(h1,48000,44100); h248 = resample(h2,48000,44100); %Carry out test dipole simulation %c = wavread('h0e030a.wav'); %c1 = c(:,2); - 277 - Appendix %c2 = c(:,1); c1 = hrtfL; c2 = hrtfR; source=zeros(8191,2); source(1,1)=1; dipolesig=[conv(source(:,1),h1i)+conv(source(:,2),h2i),conv(source(:, 2),h1i)+conv(source(:,1),h2i)]; leftspeakerl=conv(dipolesig(:,1),c1); leftspeakerr=conv(dipolesig(:,1),c2); rightspeakerl=conv(dipolesig(:,2),c2); rightspeakerr=conv(dipolesig(:,2),c1); stereoout=[leftspeakerl+rightspeakerl,leftspeakerr+rightspeakerr]; figure(7) clf; freqz(stereoout(:,1)); hold on freqz(stereoout(:,2)); 9.1.7 Matlab Code Used in FreqDip Function for the Generation of Crosstalk Cancellation Filters function [h1,h2]=freqdip(tc1,tc2,FiltLength,inband,outband) %[h1,h2]=freqdip(tc1,tc2,FiltLength,inband,outband) % Frequency Domain XTalk Cancellation Filters Lf = 500; Hf = 20000; if(nargin<3) FiltLength=2048; inband=0.0002; outband=1; elseif(nargin<5) inband=0.0002; outband=1; end LowerFreq=round(FiltLength*Lf/22050); UpperFreq=round(FiltLength*Hf/22050); reg=ones(FiltLength,1); reg(1:LowerFreq) = outband; reg(LowerFreq:UpperFreq) = inband; reg(UpperFreq:FiltLength)= outband; regx=0:22051/FiltLength:22050; figure(1) clf plot(regx,reg); c1=tc1; c2=tc2; fc1=fft(c1,FiltLength); fc2=fft(c2,FiltLength); fnc2=fft(-c2,FiltLength); Filt=(fc1.*fc1)-(fc2.*fc2); FiltDenom=1./Filt; fh1=fc1.*FiltDenom; fh2=fnc2.*FiltDenom; - 278 - Appendix w = hanning(FiltLength); h1=real(ifft(fh1,FiltLength)) .* w; h2=real(ifft(fh2,FiltLength)) .* w; figure(2) clf; plot(h1) hold on plot(h2,'r'); figure(3) clf freqz(h1,1,length(h1),44100) hold on freqz(h2,1,length(h2),44100) %Carry out test dipole simulation source=zeros(1024,2); source(1,1)=1; dipolesig=[conv(source(:,1),h1)+conv(source(:,2),h2),conv(source(:,2) ,h1)+conv(source(:,1),h2)]; leftspeakerl=conv(dipolesig(:,1),c1); leftspeakerr=conv(dipolesig(:,1),c2); rightspeakerl=conv(dipolesig(:,2),c2); rightspeakerr=conv(dipolesig(:,2),c1); stereoout=[leftspeakerl+rightspeakerl,leftspeakerr+rightspeakerr]; figure(4) plot(stereoout); 9.1.8 Matlab Code Used To Generate Inverse Filters function res = inversefilt(signal,mix) %RES = INVERSEFILT(SIGNAL) if(nargin==1) mix = 1; end fftsize=2^(ceil(log2(length(signal)))); fsignal=fft(signal,fftsize); mag = abs(fsignal); ang = angle(fsignal); newmag = 1./mag; newang = -ang; newfsignal = newmag.*exp(i*newang); newsignal = real(ifft(newfsignal,fftsize)); if(nargin==1) res = newsignal(1:length(signal)); else out = newsignal(1:length(signal)); a = grpdelay(out,1,fftsize); b = round(sum(a)/fftsize); sig = zeros(size(out)); sig(b) = 1; - 279 - Appendix fo = fft(out); fm = fft(sig); fomag = abs(fo); fmmag = abs(fm); foang = angle(fo); fmang = angle(fm); newmag = (mix * fomag) + ((1-mix) * fmmag); newang = fmang; newfft = newmag.*exp(i*newang); fres = ifft(newfft,fftsize); res = real(fres); res = res(1:length(signal)); end - 280 - Appendix 9.2 Windows C++ Code 9.2.1 Code Used for Heuristic Ambisonic Decoder Optimisations //------------------------------------------------------------------//----------------------------MAIN.CPP------------------------------//------------------------------------------------------------------#include <vcl.h> #pragma hdrstop #include "Main.h" #include <math.h> #include <fstream.h> //------------------------------------------------------------------#pragma package(smart_init) #pragma link "VolSlider" #pragma link "RotorSlider" #pragma link "LevelMeter" #pragma resource "*.dfm" TForm1 *Form1; //------------------------------------------------------------------__fastcall TForm1::TForm1(TComponent* Owner) : TForm(Owner) { LamL=LamH=1; OGainL=OGainH=1; SliderLength=32768; Bitmap = new Graphics::TBitmap; Bitmap2 = new Graphics::TBitmap; Bitmap->Height = Bevel1->Height-4; Bitmap->Width = Bevel1->Width-4; Bitmap2->Height = Bevel2->Height-4; Bitmap2->Width = Bevel2->Width-4; - 281 - Appendix MaxX = Bitmap->Width/2; MaxY = Bitmap->Height/2; NoOfSpeakers = 5; SpeakPos[0] = 0; SpeakPos[1] = Deg2Rad(30); SpeakPos[2] = Deg2Rad(115); SpeakPos[3] = Deg2Rad(-115); SpeakPos[4] = Deg2Rad(-30); ListBox1->ItemIndex=0; ListBox1Click(this); WGain[0] = WGainH[0] = (double)VolSlider1->Position/SliderLength; WGain[1] = WGainH[1] = (double)VolSlider3->Position/SliderLength; WGain[2] = WGainH[2] = (double)VolSlider6->Position/SliderLength; XGain[0] = XGainH[0] = (double)VolSlider2->Position/SliderLength; XGain[1] = XGainH[1] = (double)VolSlider4->Position/SliderLength; XGain[2] = XGainH[2] = -(double)VolSlider7->Position/SliderLength; YGain[1] = YGainH[1] = (double)VolSlider5->Position/SliderLength; YGain[2] = YGainH[2] = (double)VolSlider8->Position/SliderLength; RadioGroup1->ItemIndex=1; VolSlider1Change(this); RadioGroup1->ItemIndex=0; VolSlider1Change(this); } //------------------------------------------------------------------double TForm1::Deg2Rad(double Deg) { return (Deg*M_PI/180); } //------------------------------------------------------------------void TForm1::GPaint() { long a,b,c,d; int SpRad = 5; Bitmap->Canvas->Pen->Style = psDot; Bitmap->Canvas->Pen->Color = clBlack; Bitmap->Canvas->Brush->Style = bsSolid; Bitmap->Canvas->Brush->Color = clWhite; Bitmap->Canvas->Rectangle(0,0,Bitmap->Width,Bitmap->Height); Bitmap->Canvas->Ellipse(0,0,Bitmap->Width,Bitmap->Height); Bitmap->Canvas->Pen->Style = psSolid; Bitmap->Canvas->Brush->Style = bsSolid; Bitmap->Canvas->Brush->Color = clBlue; for(int i=0;i<NoOfSpeakers;i++) { double x,y; int r = MaxY - 10; x = r * cos(SpeakPos[i]) + MaxX; y = r * sin(SpeakPos[i]) + MaxY; Bitmap->Canvas->Rectangle( x-SpRad,y-SpRad,x+SpRad,y+SpRad); } double r8 = 0.35355339059327376220042218105242; double r2 = 0.70710678118654752440084436210485; double MFitnessL=0,AFitnessL=0,OFitnessL=0,VFitnessL=0,Ang; - 282 - Appendix double MFitnessH=0,AFitnessH=0,OFitnessH=0,VFitnessH=0; for(int i=0;i<360;i++) { double Rad = Deg2Rad(i); WSig = 1/sqrt(2); XSig = cos(Rad); YSig = sin(Rad); WSigL = (0.5*(LamL+ILamL)*WSig) + (r8*(LamL-ILamL)*XSig); XSigL = (0.5*(LamL+ILamL)*XSig) + (r2*(LamL-ILamL)*WSig); YSigL = YSig; WSigH = (0.5*(LamH+ILamH)*WSig) + (r8*(LamH-ILamH)*XSig); XSigH = (0.5*(LamH+ILamH)*XSig) + (r2*(LamH-ILamH)*WSig); YSigH = YSig; SpGain[0] = (WGain[0]*WSigL + XGain[0]*XSigL); SpGain[1] = (WGain[1]*WSigL + XGain[1]*XSigL + YGain[1]*YSigL); SpGain[2] = (WGain[2]*WSigL + XGain[2]*XSigL + YGain[2]*YSigL); SpGain[3] = (WGain[2]*WSigL + XGain[2]*XSigL YGain[2]*YSigL); SpGain[4] = (WGain[1]*WSigL + XGain[1]*XSigL – YGain[1]*YSigL); SpGainH[0] = (WGainH[0]*WSigH + XGainH[0]*XSigH); SpGainH[1] = (WGainH[1]*WSigH + XGainH[1]*XSigH + YGainH[1]*YSigH); SpGainH[2] = (WGainH[2]*WSigH + XGainH[2]*XSigH + YGainH[2]*YSigH); SpGainH[3] = (WGainH[2]*WSigH + XGainH[2]*XSigH – YGainH[2]*YSigH); SpGainH[4] = (WGainH[1]*WSigH + XGainH[1]*XSigH – YGainH[1]*YSigH); P=P2=E=VecLowX=VecLowY=VecHighX=VecHighY=0; for(int j=0;j<NoOfSpeakers;j++) { P+=SpGain[j]; P2+=SpGainH[j]*SpGainH[j]; E+=pow(SpGainH[j],2); } VolLx[i]=(P*cos(Rad)*MaxX/5)+MaxX; VolLy[i]=(P*sin(Rad)*MaxY/5)+MaxY; VolHx[i]=(P2*cos(Rad)*MaxX/5)+MaxX; VolHy[i]=(P2*sin(Rad)*MaxY/5)+MaxY; if(i==0) { LFVol = P/NoOfSpeakers; HFVol = P2/NoOfSpeakers; } for(int j=0;j<NoOfSpeakers;j++) { VecLowX+=SpGain[j]*cos(SpeakPos[j]); VecLowY+=SpGain[j]*sin(SpeakPos[j]); VecHighX+=pow(SpGainH[j],2)*cos(SpeakPos[j]); VecHighY+=pow(SpGainH[j],2)*sin(SpeakPos[j]); } if(P && E) { - 283 - Appendix VecLowX/=P; VecLowY/=P; VecHighX/=E; VecHighY/=E; } VFitnessL+=(1-((LFVol*NoOfSpeakers)/P))* (1-((LFVol*NoOfSpeakers)/P));//*((LFVol*NoOfSpeakers)-P); if(P2) VFitnessH+=(1-((HFVol*NoOfSpeakers)/P2))* (1-((HFVol*NoOfSpeakers)/P2));//*((HFVol*NoOfSpeakers)-P2); MFitnessL+=pow(1sqrt((VecLowX*VecLowX)+(VecLowY*VecLowY)),2); MFitnessH+=pow(1sqrt((VecHighX*VecHighX)+(VecHighY*VecHighY)),2); Ang=Rad-atan2(VecLowY,VecLowX); if(Ang>M_PI) Ang-=(2*M_PI); if(Ang<-M_PI) Ang+=(2*M_PI); AFitnessL+=(Ang)*(Ang); if(VecHighY || VecHighX) Ang=Rad-atan2(VecHighY,VecHighX); if(Ang>M_PI) Ang-=(2*M_PI); if(Ang<-M_PI) Ang+=(2*M_PI); AFitnessH+=Ang*Ang; VecLowX*=MaxX; VecLowY*=MaxY; VecHighX*=MaxX; VecHighY*=MaxY; VecLowX+=MaxX; VecLowY+=MaxY; VecHighX+=MaxX; VecHighY+=MaxY; if(CheckBox1->Checked) { Bitmap->Canvas->Pen->Color = clRed; Bitmap->Canvas->Ellipse(VecLowX-2, VecLowY-2,VecLowX+2,VecLowY+2); } if(CheckBox2->Checked) { Bitmap->Canvas->Pen->Color = clGreen; Bitmap->Canvas->Ellipse(VecHighX-2, VecHighY-2,VecHighX+2,VecHighY+2); } if(i==0||i==11||i==22||i==45||i==90||i==135||i==180) { Bitmap->Canvas->Pen->Color = clBlack; Bitmap->Canvas->MoveTo(MaxX,MaxY); Bitmap->Canvas->LineTo((XSig+1)*MaxX, (YSig+1)*MaxY); if(CheckBox1->Checked) { Bitmap->Canvas->Pen->Color = clRed; Bitmap->Canvas->MoveTo(MaxX,MaxY); Bitmap->Canvas->LineTo(VecLowX, VecLowY); } if(CheckBox2->Checked) { Bitmap->Canvas->Pen->Color = clGreen; Bitmap->Canvas->MoveTo(MaxX,MaxY); Bitmap->Canvas->LineTo(VecHighX, - 284 - Appendix VecHighY); } } } if(CheckBox3->Checked) { int Div=5; Bitmap->Canvas->Pen->Color=clRed; Bitmap->Canvas->MoveTo((int)VolLx[359], (int)VolLy[359]); for(int a=0;a<360;a++) { Bitmap->Canvas->LineTo((int)VolLx[a], (int)VolLy[a]); } Bitmap->Canvas->MoveTo( (int)((VolLx[359]-MaxX)/Div)+MaxX, (int)((VolLy[359]-MaxY)/Div)+MaxY); for(int a=0;a<360;a++) { Bitmap->Canvas->LineTo( (int)((VolLx[a]-MaxX)/Div)+MaxX, (int)((VolLy[a]-MaxY)/Div)+MaxY); } Bitmap->Canvas->Pen->Color=clGreen; Bitmap->Canvas->MoveTo((int)VolHx[359], (int)VolHy[359]); for(int a=0;a<360;a++) { Bitmap->Canvas->LineTo((int)VolHx[a], (int)VolHy[a]); } } VFitnessL=sqrt(VFitnessL/360.0f); VFitnessH=sqrt(VFitnessH/360.0f); AFitnessL=sqrt(AFitnessL/360.0f); AFitnessH=sqrt(AFitnessH/360.0f); MFitnessL=sqrt(MFitnessL/360.0f); MFitnessH=sqrt(MFitnessH/360.0f); OFitnessL=VFitnessL + AFitnessL + MFitnessL; OFitnessH=VFitnessH + AFitnessH + MFitnessH; a = Bevel1->Left + 2; b = Bevel1->Top + 2; c = Bevel1->Width + a -2; d = Bevel1->Height + b -2; BitBlt(Form1->Canvas->Handle,a,b,c,d, Bitmap->Canvas->Handle,0,0,SRCCOPY); MFitL->Text=FloatToStrF(MFitnessL,ffFixed,5,5); MFitH->Text=FloatToStrF(MFitnessH,ffFixed,5,5); AFitL->Text=FloatToStrF(AFitnessL,ffFixed,5,5); AFitL2->Text=FloatToStrF(AFitnessL,ffFixed,5,5); AFitH->Text=FloatToStrF(AFitnessH,ffFixed,5,5); VFitL->Text=FloatToStrF(VFitnessL,ffFixed,5,5); VFitH->Text=FloatToStrF(VFitnessH,ffFixed,5,5); OFitL->Text=FloatToStrF(OFitnessL,ffFixed,5,5); OFitH->Text=FloatToStrF(OFitnessH,ffFixed,5,5); LFEdit->Text=FloatToStrF(LFVol,ffFixed,3,3); HFEdit->Text=FloatToStrF(HFVol,ffFixed,3,3); LevelMeter1->MeterReading=(int)(LFVol*75); LevelMeter2->MeterReading=(int)(HFVol*75); } //------------------------------------------------------------------- - 285 - Appendix void TForm1::RPaint() { long a,b,c,d; int skip = 9; Bitmap2->Canvas->Pen->Style = psDot; Bitmap2->Canvas->Pen->Color = clBlack; Bitmap2->Canvas->Brush->Style = bsSolid; Bitmap2->Canvas->Brush->Color = clWhite; Bitmap2->Canvas->Rectangle(0,0, Bitmap2->Width,Bitmap2->Height); for(int i=0;i<360;i+=skip) { if(RadioGroup1->ItemIndex==0) { Rep1[i] = 0.5 * (0.7071 * WGain[0] + cos(Deg2Rad(i))*XGain[0]); Rep2[i] = 0.5 * (0.7071 * WGain[1] + cos(Deg2Rad(i))*XGain[1] + sin(Deg2Rad(i))*YGain[1]); Rep3[i] = 0.5 * (0.7071 * WGain[2] + cos(Deg2Rad(i))*XGain[2] + sin(Deg2Rad(i))*YGain[2]); Rep4[i] = 0.5 * (0.7071 * WGain[2] + cos(Deg2Rad(i))*XGain[2] - sin(Deg2Rad(i))*YGain[2]); Rep5[i] = 0.5 * (0.7071 * WGain[1] + cos(Deg2Rad(i))*XGain[1] - sin(Deg2Rad(i))*YGain[1]); Rep1[i]<0?Rep1[i]=-Rep1[i]:Rep1[i]=Rep1[i]; Rep2[i]<0?Rep2[i]=-Rep2[i]:Rep2[i]=Rep2[i]; Rep3[i]<0?Rep3[i]=-Rep3[i]:Rep3[i]=Rep3[i]; Rep4[i]<0?Rep4[i]=-Rep4[i]:Rep4[i]=Rep4[i]; Rep5[i]<0?Rep5[i]=-Rep5[i]:Rep5[i]=Rep5[i]; } else { Rep1[i] = 0.5 * (0.7071 * WGainH[0] + cos(Deg2Rad(i))*XGainH[0]); Rep2[i] = 0.5 * (0.7071 * WGainH[1] + cos(Deg2Rad(i))*XGainH[1] + sin(Deg2Rad(i))*YGainH[1]); Rep3[i] = 0.5 * (0.7071 * WGainH[2] + cos(Deg2Rad(i))*XGainH[2] + sin(Deg2Rad(i))*YGainH[2]); Rep4[i] = 0.5 * (0.7071 * WGainH[2] + cos(Deg2Rad(i))*XGainH[2] - sin(Deg2Rad(i))*YGainH[2]); Rep5[i] = 0.5 * (0.7071 * WGainH[1] + cos(Deg2Rad(i))*XGainH[1] - sin(Deg2Rad(i))*YGainH[1]); Rep1[i]<0?Rep1[i]=-Rep1[i]:Rep1[i]=Rep1[i]; Rep2[i]<0?Rep2[i]=-Rep2[i]:Rep2[i]=Rep2[i]; Rep3[i]<0?Rep3[i]=-Rep3[i]:Rep3[i]=Rep3[i]; Rep4[i]<0?Rep4[i]=-Rep4[i]:Rep4[i]=Rep4[i]; Rep5[i]<0?Rep5[i]=-Rep5[i]:Rep5[i]=Rep5[i]; } } Bitmap2->Canvas->Pen->Width = 2; Bitmap2->Canvas->Pen->Style=psSolid; Bitmap2->Canvas->Pen->Color=clBlack; PlotPolar(Bitmap2,Rep1,skip); Bitmap2->Canvas->Pen->Color=clRed; PlotPolar(Bitmap2,Rep2,skip); Bitmap2->Canvas->Pen->Color=clBlue; PlotPolar(Bitmap2,Rep3,skip); Bitmap2->Canvas->Pen->Color=clPurple; PlotPolar(Bitmap2,Rep4,skip); Bitmap2->Canvas->Pen->Color=clTeal; PlotPolar(Bitmap2,Rep5,skip); a = Bevel2->Left + 2; - 286 - Appendix b = Bevel2->Top + 2; c = Bevel2->Width + a -2; d = Bevel2->Height + b -2; BitBlt(Form1->Canvas->Handle,a,b,c,d, Bitmap2->Canvas->Handle,0,0,SRCCOPY); } //------------------------------------------------------------------void __fastcall TForm1::Button1Click(TObject *Sender) { GPaint(); RPaint(); } //------------------------------------------------------------------void __fastcall TForm1::FormPaint(TObject *Sender) { GPaint(); RPaint(); } //------------------------------------------------------------------void __fastcall TForm1::VolSlider1Change(TObject *Sender) { if(RadioGroup1->ItemIndex==0) { OGainL = (double)VolSlider10->Position*2/SliderLength; WGain[0] = (double)OGainL*VolSlider1->Position/SliderLength; WGain[1] = (double)OGainL*VolSlider3->Position/SliderLength; WGain[2] = (double)OGainL*VolSlider6->Position/SliderLength; XGain[0] = (double)OGainL*VolSlider2->Position/SliderLength; XGain[1] = (double)OGainL*VolSlider4->Position/SliderLength; XGain[2] = -(double)OGainL*VolSlider7->Position/SliderLength; YGain[1] = (double)OGainL*VolSlider5->Position/SliderLength; YGain[2] = (double)OGainL*VolSlider8->Position/SliderLength; LamL = (double)VolSlider9->Position*2/SliderLength; if(LamL) ILamL=1/LamL; } else if(RadioGroup1->ItemIndex==1) { WGainH[0] = (double)OGainH*VolSlider1->Position/SliderLength; WGainH[1] = (double)OGainH*VolSlider3->Position/SliderLength; WGainH[2] = (double)OGainH*VolSlider6->Position/SliderLength; XGainH[0] = (double)OGainH*VolSlider2->Position/SliderLength; XGainH[1] = (double)OGainH*VolSlider4->Position/SliderLength; XGainH[2] = -(double)OGainH*VolSlider7->Position/SliderLength; - 287 - Appendix YGainH[1] = (double)OGainH*VolSlider5->Position/SliderLength; YGainH[2] = (double)OGainH*VolSlider8->Position/SliderLength; LamH = (double)VolSlider9->Position*2/SliderLength; if(LamH) ILamH=1/LamH; OGainH = (double)VolSlider10->Position*2/SliderLength; } else if(RadioGroup1->ItemIndex==2) { OGainH = OGainL = (double)VolSlider10->Position*2/SliderLength; WGainH[0] = WGain[0] = (double)OGainL*VolSlider1->Position/SliderLength; WGainH[1] = WGain[1] = (double)OGainL*VolSlider3->Position/SliderLength; WGainH[2] = WGain[2] = (double)OGainL*VolSlider6->Position/SliderLength; XGainH[0] = XGain[0] = (double)OGainL*VolSlider2->Position/SliderLength; XGainH[1] = XGain[1] = (double)OGainL*VolSlider4->Position/SliderLength; XGainH[2] = XGain[2] = (double)OGainL*VolSlider7->Position/SliderLength; YGainH[1] = YGain[1] = (double)OGainL*VolSlider5->Position/SliderLength; YGainH[2] = YGain[2] = (double)OGainL*VolSlider8->Position/SliderLength; LamH = LamL = (double)VolSlider9->Position*2/SliderLength; if(LamL) ILamL=1/LamL; if(LamH) ILamH=1/LamH; } UpdateEdits(); GPaint(); RPaint(); } //------------------------------------------------------------------void TForm1::UpdateEdits() { if(RadioGroup1->ItemIndex==0) { Edit1->Text=FloatToStrF(WGain[0], ffFixed,3,3); Edit3->Text=FloatToStrF(WGain[1], ffFixed,3,3); Edit6->Text=FloatToStrF(WGain[2], ffFixed,3,3); Edit2->Text=FloatToStrF(XGain[0], ffFixed,3,3); Edit4->Text=FloatToStrF(XGain[1], ffFixed,3,3); Edit7->Text=FloatToStrF(XGain[2], ffFixed,3,3); Edit5->Text=FloatToStrF(YGain[1], ffFixed,3,3); - 288 - Appendix Edit8->Text=FloatToStrF(YGain[2], ffFixed,3,3); Edit9->Text=FloatToStrF(LamL,ffFixed,3,3); Edit10->Text=FloatToStrF(OGainL,ffFixed,3,3); } else if(RadioGroup1->ItemIndex==1) { Edit1->Text=FloatToStrF(WGainH[0], ffFixed,3,3); Edit3->Text=FloatToStrF(WGainH[1], ffFixed,3,3); Edit6->Text=FloatToStrF(WGainH[2], ffFixed,3,3); Edit2->Text=FloatToStrF(XGainH[0], ffFixed,3,3); Edit4->Text=FloatToStrF(XGainH[1], ffFixed,3,3); Edit7->Text=FloatToStrF(XGainH[2], ffFixed,3,3); Edit5->Text=FloatToStrF(YGainH[1], ffFixed,3,3); Edit8->Text=FloatToStrF(YGainH[2], ffFixed,3,3); Edit9->Text=FloatToStrF(LamH,ffFixed,3,3); Edit10->Text=FloatToStrF(OGainH,ffFixed,3,3); } } //------------------------------------------------------------------void TForm1::UpdateNewEdits() { if(RadioGroup1->ItemIndex==0) { GEdit1->Text=FloatToStrF( (float)GainSlider1->Position/100,ffFixed,3,3); GEdit2->Text=FloatToStrF( (float)GainSlider2->Position/100,ffFixed,3,3); GEdit3->Text=FloatToStrF( (float)GainSlider3->Position/100,ffFixed,3,3); DEdit1->Text=FloatToStrF( (float)DSlider1->Position/100,ffFixed,3,3); DEdit2->Text=FloatToStrF( (float)DSlider2->Position/100,ffFixed,3,3); DEdit3->Text=FloatToStrF( (float)DSlider3->Position/100,ffFixed,3,3); AEdit1->Text=IntToStr( (int)ASlider1->DotPosition); AEdit2->Text=IntToStr( (int)ASlider2->DotPosition); AEdit3->Text=IntToStr( (int)ASlider3->DotPosition); } else if(RadioGroup1->ItemIndex==1) { GEdit1->Text=FloatToStrF( (float)GainSlider1->Position/100,ffFixed,3,3); GEdit2->Text=FloatToStrF( (float)GainSlider2->Position/100,ffFixed,3,3); GEdit3->Text=FloatToStrF( (float)GainSlider3->Position/100,ffFixed,3,3); DEdit1->Text=FloatToStrF( (float)DSlider1->Position/100,ffFixed,3,3); DEdit2->Text=FloatToStrF( - 289 - Appendix (float)DSlider2->Position/100,ffFixed,3,3); DEdit3->Text=FloatToStrF( (float)DSlider3->Position/100,ffFixed,3,3); AEdit1->Text=FloatToStrF( (float)ASlider1->DotPosition/100,ffFixed,3,3); AEdit2->Text=FloatToStrF( (float)ASlider2->DotPosition/100,ffFixed,3,3); AEdit3->Text=FloatToStrF( (float)ASlider3->DotPosition/100,ffFixed,3,3); } } //------------------------------------------------------------------void __fastcall TForm1::ListBox1Click(TObject *Sender) { if(ListBox1->ItemIndex==0) { VolSlider1->Position = 0.34190f*SliderLength; VolSlider3->Position = 0.26813f*SliderLength; VolSlider6->Position = 0.56092f*SliderLength; VolSlider2->Position = 0.23322f*SliderLength; VolSlider4->Position = 0.38191f*SliderLength; VolSlider7->Position = 0.49852f*SliderLength; VolSlider5->Position = 0.50527f*SliderLength; VolSlider8->Position = 0.45666f*SliderLength; VolSlider9->Position = 1*SliderLength/2; VolSlider10->Position = 1*SliderLength/2; VolSlider1Change(this); WGainH[0]=0.38324f; WGainH[1]=0.44022f; WGainH[2]=0.78238f; XGainH[0]=0.37228f; XGainH[1]=0.23386f; XGainH[2]=-0.55322f; YGainH[1]=0.54094f; YGainH[2]=0.42374f; LamH=1; ILamH=1/LamH; OGainH=1; } else if(ListBox1->ItemIndex==1) { RadioGroup1->ItemIndex=0; VolSlider1->Position = 0.58*SliderLength; VolSlider3->Position = 0.16*SliderLength; VolSlider6->Position = 1*SliderLength; VolSlider2->Position = 0.47*SliderLength; VolSlider4->Position = 0.53*SliderLength; VolSlider7->Position = 0.77*SliderLength; VolSlider5->Position = 0.55*SliderLength; VolSlider8->Position = 0.83*SliderLength; VolSlider9->Position = 1*SliderLength/2; VolSlider10->Position = 1*SliderLength/2; VolSlider1Change(this); WGainH[0]=0.260; WGainH[1]=0.320; WGainH[2]=1.000; XGainH[0]=0.200; XGainH[1]=0.280; XGainH[2]=-0.64; YGainH[1]=0.480; - 290 - Appendix YGainH[2]=0.340; LamH=1; ILamH=1/LamH; OGainH=1; } else if(ListBox1->ItemIndex==2) { RadioGroup1->ItemIndex=0; VolSlider1->Position = sqrt(2.0f)*SliderLength; VolSlider3->Position = sqrt(2.0f)*SliderLength; VolSlider6->Position = sqrt(2.0f)*SliderLength; VolSlider2->Position = cos(SpeakPos[0])*SliderLength; VolSlider4->Position = cos(Deg2Rad(45))*SliderLength; VolSlider7->Position = -cos(Deg2Rad(135)) *SliderLength; VolSlider5->Position = sin(Deg2Rad(45))*SliderLength; VolSlider8->Position = sin(Deg2Rad(135)) *SliderLength; VolSlider9->Position = 1*SliderLength/2; VolSlider10->Position = 1*SliderLength/2; VolSlider1Change(this); WGainH[0]=WGain[0]; WGainH[1]=WGain[1]; WGainH[2]=WGain[2]; XGainH[0]=XGain[0]; XGainH[1]=XGain[1]; XGainH[2]=XGain[2]; YGainH[1]=YGain[1]; YGainH[2]=YGain[2]; LamH=1; ILamH=1/LamH; OGainH=1; } else if(ListBox1->ItemIndex==3) { RadioGroup1->ItemIndex=0; VolSlider1->Position = 0.023*SliderLength; VolSlider3->Position = 0.4232*SliderLength; VolSlider6->Position = 0.9027*SliderLength; VolSlider2->Position = 0.2518*SliderLength; VolSlider4->Position = 0.6014*SliderLength; VolSlider7->Position = 0.7245*SliderLength; VolSlider5->Position = 0.2518*SliderLength; VolSlider8->Position = 0.9062*SliderLength; VolSlider9->Position = 1*SliderLength/2; VolSlider10->Position = 1*SliderLength/2; VolSlider1Change(this); WGainH[0]=0; WGainH[1]=0.6086; WGainH[2]=1.0290; XGainH[0]=0; XGainH[1]=0.4998; XGainH[2]=-0.2058; YGainH[1]=0.3861; YGainH[2]=0.2489; LamH=0.9270; ILamH=1/LamH; OGainH=1; } else if(ListBox1->ItemIndex==4) { RadioGroup1->ItemIndex=0; - 291 - Appendix VolSlider1->Position = 0.26*SliderLength; VolSlider3->Position = 0.34*SliderLength; VolSlider6->Position = 1*SliderLength; VolSlider2->Position = 0.247*SliderLength; VolSlider4->Position = 0.66*SliderLength; VolSlider7->Position = 0.78*SliderLength; VolSlider5->Position = 1*SliderLength; VolSlider8->Position = 0.587*SliderLength; VolSlider9->Position = 1*SliderLength/2; VolSlider10->Position = 1*SliderLength/2; VolSlider1Change(this); WGainH[0]=0.312; WGainH[1]=0.503; WGainH[2]=0.868; XGainH[0]=0.176; XGainH[1]=0.563; XGainH[2]=-0.41; YGainH[1]=0.517; YGainH[2]=0.510; LamH=1.030; ILamH=1/LamH; OGainH=1; } GPaint(); RPaint(); } //------------------------------------------------------------------void __fastcall TForm1::CheckBox1Click(TObject *Sender) { VolSlider1Change(this); } //------------------------------------------------------------------void TForm1::PlotPolar(Graphics::TBitmap *Bmap,double *Radius, int skip) { int t1,t2; t1=(int)(Radius[360-skip]*cos(Deg2Rad(360-skip))*MaxX)+MaxX; t2=(int)(Radius[360-skip]*sin(Deg2Rad(360-skip))*MaxY)+MaxY; Bmap->Canvas->MoveTo(t1,t2); for(int i=0;i<360;i+=skip) { t1=(int)(Radius[i]*cos(Deg2Rad(i))*MaxX)+MaxX; t2=(int)(Radius[i]*sin(Deg2Rad(i))*MaxY)+MaxY; Bmap->Canvas->LineTo(t1,t2); } } //------------------------------------------------------------------void __fastcall TForm1::RadioGroup1Click(TObject *Sender) { if(RadioGroup1->ItemIndex==0) { VolSlider1->Position = (int)(WGain[0]*SliderLength); VolSlider3->Position = (int)(WGain[1]*SliderLength); VolSlider6->Position = (int)(WGain[2]*SliderLength); VolSlider2->Position = (int)(XGain[0]*SliderLength); VolSlider4->Position = (int)(XGain[1]*SliderLength); VolSlider7->Position = (int)(-XGain[2]*SliderLength); VolSlider5->Position = (int)(YGain[1]*SliderLength); VolSlider8->Position = (int)(YGain[2]*SliderLength); VolSlider9->Position = (int)(LamL*SliderLength/2); VolSlider10->Position = (int)(OGainL*SliderLength/2); - 292 - Appendix } else if(RadioGroup1->ItemIndex==1) { VolSlider1->Position = (int)(WGainH[0]*SliderLength); VolSlider3->Position = (int)(WGainH[1]*SliderLength); VolSlider6->Position = (int)(WGainH[2]*SliderLength); VolSlider2->Position = (int)(XGainH[0]*SliderLength); VolSlider4->Position = (int)(XGainH[1]*SliderLength); VolSlider7->Position = (int)(-XGainH[2]*SliderLength); VolSlider5->Position = (int)(YGainH[1]*SliderLength); VolSlider8->Position = (int)(YGainH[2]*SliderLength); VolSlider9->Position = (int)(LamH*SliderLength/2); VolSlider10->Position = (int)(OGainH*SliderLength/2); } UpdateEdits(); RPaint(); } //------------------------------------------------------------------void __fastcall TForm1::GainSlider1Change(TObject *Sender) { if(RadioGroup1->ItemIndex==0) { WGain[0] = (double)((double)GainSlider1->Position/100 *(2-(double)DSlider1->Position/100)); WGain[1] = (double)((double)GainSlider2->Position/100 *(2-(double)DSlider2->Position/100)); WGain[2] = (double)((double)GainSlider3->Position/100 *(2-(double)DSlider3->Position/100)); XGain[0] = (double)((double)GainSlider1->Position/100 *((double)DSlider1->Position/100 * cos(Deg2Rad((double)ASlider1->DotPosition)))); XGain[1] = (double)((double)GainSlider2->Position/100 *((double)DSlider2->Position/100 * cos(Deg2Rad((double)ASlider2->DotPosition)))); XGain[2] = (double)((double)GainSlider3->Position/100 *((double)DSlider3->Position/100 * cos(Deg2Rad((double)ASlider3->DotPosition)))); YGain[1] = (double)((double)GainSlider2->Position/100 *((double)DSlider2->Position/100 * sin(Deg2Rad((double)ASlider2->DotPosition)))); YGain[2] = (double)((double)GainSlider3->Position/100 *((double)DSlider3->Position/100 * sin(Deg2Rad((double)ASlider3->DotPosition)))); } else if(RadioGroup1->ItemIndex==1) { WGainH[0] = (double)(GainSlider1->Position/100 *(2-DSlider1->Position/100)); WGainH[1] = (double)(GainSlider2->Position/100 *(2-DSlider2->Position/100)); WGainH[2] = (double)(GainSlider3->Position/100 *(2-DSlider3->Position/100)); XGainH[0] = (double)(GainSlider1->Position/100 *(DSlider1->Position/100 * cos(Deg2Rad((double)ASlider1->DotPosition)))); XGainH[1] = (double)(GainSlider2->Position/100 *(DSlider2->Position/100 * cos(Deg2Rad((double)ASlider1->DotPosition)))); XGainH[2] = (double)(GainSlider3->Position/100 *(DSlider3->Position/100 * cos(Deg2Rad((double)ASlider1->DotPosition)))); - 293 - Appendix YGainH[1] = (double)(GainSlider2->Position/100 *(DSlider2->Position/100 * sin(Deg2Rad((double)ASlider1->DotPosition)))); YGainH[2] = (double)(GainSlider3->Position/100 *(DSlider3->Position/100 * sin(Deg2Rad((double)ASlider1->DotPosition)))); } UpdateNewEdits(); GPaint(); RPaint(); } //------------------------------------------------------------------void __fastcall TForm1::RadioGroup2Click(TObject *Sender) { if(RadioGroup2->ItemIndex==0) { Panel1->Show(); Panel2->Hide(); } else if(RadioGroup2->ItemIndex==1) { Panel2->Show(); Panel1->Hide(); } } //------------------------------------------------------------------void __fastcall TForm1::Button2Click(TObject *Sender) { RadioGroup1->ItemIndex=0; double GainDif=HFVol/LFVol; VolSlider1->Position*=GainDif; VolSlider2->Position*=GainDif; VolSlider3->Position*=GainDif; VolSlider4->Position*=GainDif; VolSlider5->Position*=GainDif; VolSlider6->Position*=GainDif; VolSlider7->Position*=GainDif; VolSlider8->Position*=GainDif; VolSlider1Change(this); RPaint(); GPaint(); } //------------------------------------------------------------------void __fastcall TForm1::Button3Click(TObject *Sender) { RadioGroup1->ItemIndex=1; double GainDif=LFVol/HFVol; VolSlider1->Position*=GainDif; VolSlider2->Position*=GainDif; VolSlider3->Position*=GainDif; VolSlider4->Position*=GainDif; VolSlider5->Position*=GainDif; VolSlider6->Position*=GainDif; VolSlider7->Position*=GainDif; VolSlider8->Position*=GainDif; VolSlider1Change(this); RPaint(); GPaint(); } //------------------------------------------------------------------void __fastcall TForm1::Button4Click(TObject *Sender) - 294 - Appendix { Button4->Enabled=false; RadioGroup1->ItemIndex=0; Iterations = StrToInt(Edit12->Text); int ItCount = Iterations; MaxTabu = StrToInt(Edit13->Text); StepSize = StrToFloat(Edit14->Text); TempArray[0]=WGain[0]; TempArray[2]=WGain[1]; TempArray[5]=WGain[2]; TempArray[1]=XGain[0]; TempArray[3]=XGain[1]; TempArray[6]=-XGain[2]; TempArray[4]=YGain[1]; TempArray[7]=YGain[2]; TempArray[8]=LamL; TSearch = new Tabu(TempArray,SpeakPos,5); TSearch->StepSize = StepSize; TSearch->MMax = MaxTabu; for(int a=0;a<Iterations;a++) { TSearch->StartTabu(); WGain[0]=TSearch->CBest[0]; XGain[0]=TSearch->CBest[1]; WGain[1]=TSearch->CBest[2]; XGain[1]=TSearch->CBest[3]; YGain[1]=TSearch->CBest[4]; WGain[2]=TSearch->CBest[5]; XGain[2]=-TSearch->CBest[6]; YGain[2]=TSearch->CBest[7]; LamL=TSearch->CBest[8]; TEdit1->Text=FloatToStrF( TSearch->CBest[0],ffFixed,3,3); TEdit2->Text=FloatToStrF( TSearch->CBest[1],ffFixed,3,3); TEdit3->Text=FloatToStrF( TSearch->CBest[2],ffFixed,3,3); TEdit4->Text=FloatToStrF( TSearch->CBest[3],ffFixed,3,3); TEdit5->Text=FloatToStrF( TSearch->CBest[4],ffFixed,3,3); TEdit6->Text=FloatToStrF( TSearch->CBest[5],ffFixed,3,3); TEdit7->Text=FloatToStrF( -TSearch->CBest[6],ffFixed,3,3); TEdit8->Text=FloatToStrF( TSearch->CBest[7],ffFixed,3,3); TEdit9->Text=FloatToStrF( TSearch->CBest[8],ffFixed,3,3); TEditRes->Text=FloatToStrF( TSearch->ResBestLocal,ffFixed,5,5); Edit11->Text=FloatToStrF( TSearch->ResBestOverall,ffFixed,5,5); RadioGroup1Click(this); VolSlider1Change(this); Edit12->Text = IntToStr(--ItCount); Application->ProcessMessages(); } WGain[0]=TSearch->OBest[0]; XGain[0]=TSearch->OBest[1]; WGain[1]=TSearch->OBest[2]; XGain[1]=TSearch->OBest[3]; - 295 - Appendix YGain[1]=TSearch->OBest[4]; WGain[2]=TSearch->OBest[5]; XGain[2]=-TSearch->OBest[6]; YGain[2]=TSearch->OBest[7]; RadioGroup1Click(this); VolSlider1Change(this); Application->ProcessMessages(); delete TSearch; Button4->Enabled=true; Edit12->Text = IntToStr(Iterations); } //------------------------------------------------------------------void __fastcall TForm1::Button5Click(TObject *Sender) { Button5->Enabled=false; RadioGroup1->ItemIndex=1; Iterations = StrToInt(Edit12->Text); int ItCount = Iterations; MaxTabu = StrToInt(Edit13->Text); StepSize = StrToFloat(Edit14->Text); TempArray[0]=WGainH[0]; TempArray[2]=WGainH[1]; TempArray[5]=WGainH[2]; TempArray[1]=XGainH[0]; TempArray[3]=XGainH[1]; TempArray[6]=-XGainH[2]; TempArray[4]=YGainH[1]; TempArray[7]=YGainH[2]; TempArray[8]=LamH; TSearchH = new HighTabu(TempArray,SpeakPos,5); TSearchH->StepSize = StepSize; TSearchH->MMax = MaxTabu; for(int a=0;a<Iterations;a++) { TSearchH->StartTabu(); WGainH[0]=TSearchH->CBest[0]; XGainH[0]=TSearchH->CBest[1]; WGainH[1]=TSearchH->CBest[2]; XGainH[1]=TSearchH->CBest[3]; YGainH[1]=TSearchH->CBest[4]; WGainH[2]=TSearchH->CBest[5]; XGainH[2]=-TSearchH->CBest[6]; YGainH[2]=TSearchH->CBest[7]; LamH=TSearchH->CBest[8]; TEdit1->Text=FloatToStrF( TSearchH->CBest[0],ffFixed,3,3); TEdit2->Text=FloatToStrF( TSearchH->CBest[1],ffFixed,3,3); TEdit3->Text=FloatToStrF( TSearchH->CBest[2],ffFixed,3,3); TEdit4->Text=FloatToStrF( TSearchH->CBest[3],ffFixed,3,3); TEdit5->Text=FloatToStrF( TSearchH->CBest[4],ffFixed,3,3); TEdit6->Text=FloatToStrF( TSearchH->CBest[5],ffFixed,3,3); TEdit7->Text=FloatToStrF( -TSearchH->CBest[6],ffFixed,3,3); TEdit8->Text=FloatToStrF( TSearchH->CBest[7],ffFixed,3,3); TEdit9->Text=FloatToStrF( TSearchH->CBest[8],ffFixed,3,3); - 296 - Appendix TEditRes->Text=FloatToStrF( TSearchH->ResBestLocal,ffFixed,5,5); Edit11->Text=FloatToStrF( TSearchH->ResBestOverall,ffFixed,5,5); RadioGroup1Click(this); VolSlider1Change(this); Edit12->Text = IntToStr(--ItCount); Application->ProcessMessages(); } WGainH[0]=TSearchH->OBest[0]; XGainH[0]=TSearchH->OBest[1]; WGainH[1]=TSearchH->OBest[2]; XGainH[1]=TSearchH->OBest[3]; YGainH[1]=TSearchH->OBest[4]; WGainH[2]=TSearchH->OBest[5]; XGainH[2]=-TSearchH->OBest[6]; YGainH[2]=TSearchH->OBest[7]; RadioGroup1Click(this); VolSlider1Change(this); Application->ProcessMessages(); delete TSearchH; Button5->Enabled=true; Edit12->Text = IntToStr(Iterations); } //------------------------------------------------------------------#define Write(a) fwrite((FloatToStrF(a,ffFixed,5,5)).c_str(),1,5,File) #define WriteTxt(a) fwrite(a,1,sizeof(a)-1,File) #define NewLine fwrite("\n",1,1,File) void __fastcall TForm1::SaveButtonClick(TObject *Sender) { FILE *File; if(SaveDialog1->Execute()) { File = fopen(SaveDialog1->FileName.c_str(),"w"); WriteTxt("WLow-C\t");Write(WGain[0]);NewLine; WriteTxt("XLow-C\t");Write(XGain[0]);NewLine; WriteTxt("WLow-F\t");Write(WGain[1]);NewLine; WriteTxt("XLow-F\t");Write(XGain[1]);NewLine; WriteTxt("YLow-F\t");Write(YGain[1]);NewLine; WriteTxt("WLow-R\t");Write(WGain[2]);NewLine; WriteTxt("XLow-R\t");Write(XGain[2]);NewLine; WriteTxt("YLow-R\t");Write(YGain[2]);NewLine; NewLine; WriteTxt("WHigh-C\t");Write(WGainH[0]);NewLine; WriteTxt("XHigh-C\t");Write(XGainH[0]);NewLine; WriteTxt("WHigh-F\t");Write(WGainH[1]);NewLine; WriteTxt("XHigh-F\t");Write(XGainH[1]);NewLine; WriteTxt("YHigh-F\t");Write(YGainH[1]);NewLine; WriteTxt("WHigh-R\t");Write(WGainH[2]);NewLine; WriteTxt("XHigh-R\t");Write(XGainH[2]);NewLine; WriteTxt("YHigh-R\t");Write(YGainH[2]);NewLine; fclose(File); } } //------------------------------------------------------------------- - 297 - Appendix //------------------------------------------------------------------//-------------------------MAIN.H-----------------------------------//------------------------------------------------------------------#ifndef MainH #define MainH //------------------------------------------------------------------#include <Classes.hpp> #include <Controls.hpp> #include <StdCtrls.hpp> #include <Forms.hpp> #include <ExtCtrls.hpp> #include "VolSlider.h" #include "RotorSlider.h" #include "LevelMeter.h" #include "Tabu.h" #include "HighTabu.h" #include <Dialogs.hpp> //------------------------------------------------------------------class TForm1 : public TForm { // IDE-managed Components __published: TBevel *Bevel1; TButton *Button1; TListBox *ListBox1; TBevel *Bevel2; TRadioGroup *RadioGroup1; TGroupBox *GroupBox1; TCheckBox *CheckBox2; TCheckBox *CheckBox1; TListBox *ListBox2; TPanel *Panel1; TVolSlider *VolSlider1; TVolSlider *VolSlider2; TVolSlider *VolSlider3; TVolSlider *VolSlider4; TVolSlider *VolSlider5; TVolSlider *VolSlider6; TVolSlider *VolSlider7; TVolSlider *VolSlider8; TEdit *Edit1; TEdit *Edit2; TEdit *Edit3; TEdit *Edit4; TEdit *Edit5; TEdit *Edit6; TEdit *Edit7; TEdit *Edit8; TLabel *CW; TLabel *CX; TLabel *Label2; TLabel *Label3; TLabel *Label4; TLabel *Label5; TLabel *Label6; TLabel *Label7; TRadioGroup *RadioGroup2; TPanel *Panel2; TVolSlider *GainSlider1; TRotorSlider *ASlider1; TVolSlider *DSlider1; TEdit *GEdit1; TEdit *AEdit1; - 298 - Appendix TEdit *DEdit1; TLabel *Label1; TLabel *Label8; TVolSlider *GainSlider2; TEdit *GEdit2; TEdit *AEdit2; TRotorSlider *ASlider2; TVolSlider *DSlider2; TEdit *DEdit2; TLabel *Label9; TVolSlider *GainSlider3; TEdit *GEdit3; TEdit *AEdit3; TRotorSlider *ASlider3; TVolSlider *DSlider3; TEdit *DEdit3; TLevelMeter *LevelMeter1; TLevelMeter *LevelMeter2; TEdit *LFEdit; TEdit *HFEdit; TLabel *Label10; TLabel *Label11; TButton *Button2; TButton *Button3; TCheckBox *CheckBox3; TVolSlider *VolSlider9; TVolSlider *VolSlider10; TLabel *Label12; TLabel *Label13; TEdit *Edit9; TEdit *Edit10; TLabel *Label14; TLabel *Label15; TLabel *Label16; TEdit *MFitL; TEdit *AFitL; TEdit *VFitL; TLabel *Label17; TLabel *Label18; TLabel *Label19; TEdit *MFitH; TEdit *AFitH; TEdit *VFitH; TLabel *Label20; TLabel *Label21; TEdit *OFitL; TEdit *OFitH; TLabel *Label22; TLabel *Label23; TPanel *Panel3; TLabel *Label24; TEdit *TEdit1; TEdit *TEdit2; TEdit *TEdit3; TEdit *TEdit4; TEdit *TEdit5; TEdit *TEdit6; TEdit *TEdit7; TEdit *TEdit8; TEdit *TEdit9; TLabel *Label25; TEdit *TEditRes; - 299 - Appendix TButton *Button4; TEdit *Edit11; TLabel *Label26; TLabel *Label27; TButton *Button5; TEdit *Edit12; TLabel *Label28; TLabel *Label29; TEdit *Edit13; TLabel *Label30; TEdit *Edit14; TButton *SaveButton; TSaveDialog *SaveDialog1; TEdit *AFitL2; TLabel *Label31; void __fastcall Button1Click(TObject *Sender); void __fastcall FormPaint(TObject *Sender); void __fastcall VolSlider1Change(TObject *Sender); void __fastcall ListBox1Click(TObject *Sender); void __fastcall CheckBox1Click(TObject *Sender); void __fastcall RadioGroup1Click(TObject *Sender); void __fastcall GainSlider1Change(TObject *Sender); void __fastcall RadioGroup2Click(TObject *Sender); void __fastcall Button2Click(TObject *Sender); void __fastcall Button3Click(TObject *Sender); void __fastcall Button4Click(TObject *Sender); void __fastcall Button5Click(TObject *Sender); void __fastcall SaveButtonClick(TObject *Sender); private: // User declarations bool InUse; long MaxX, MaxY; Graphics::TBitmap *Bitmap,*Bitmap2; int NoOfSpeakers,SliderLength,Iterations; double SpeakPos[8],SpGain[8],SpGainH[8],WSig,XSig,YSig, WGain[3],XGain[3],YGain[3],WGainH[3],XGainH[3], YGainH[3],WSigH,WSigL,XSigH,XSigL,YSigH,YSigL; double P,P2,E,VecLowX,VecLowY,VecHighX,VecHighY, Rep1[360],Rep2[360],Rep3[360],Rep4[360],Rep5[360], LFVol,HFVol,VolLx[360],VolHx[360],VolLy[360], VolHy[360],LamL,ILamL,LamH,ILamH,OGainL,OGainH; double Deg2Rad(double Deg); void PlotPolar(Graphics::TBitmap *Bitmap,double *Radius, int skip); void UpdateEdits(); void UpdateNewEdits(); double TempArray[9],StepSize,MaxTabu; // User declarations public: __fastcall TForm1(TComponent* Owner); void GPaint(); void RPaint(); Tabu *TSearch; HighTabu *TSearchH; }; //------------------------------------------------------------------extern PACKAGE TForm1 *Form1; //------------------------------------------------------------------#endif - 300 - Appendix //------------------------------------------------------------------//---------------------------TABU.H---------------------------------//------------------------------------------------------------------#ifndef TabuH #define TabuH //------------------------------------------------------------------#include <math.h> class Tabu { private: double Current[32],SPosition[32],SGain[32],Vx[512],Vy[512], V2x[512],V2y[512]; double ResCurrent; double MFit,VFit,AFit,AFit2,P,VolScale,E; double NAngles,AStep; double W,X,Y,WSig,XSig,YSig; int NSpeakers,ResControl,CDir[32],ResCDir; public: double CBest[32],OBest[32],ResBestLocal,ResBestOverall; double StepSize; int MUp[32],MDown[32],MMax; Tabu(double *Array, double *SPos, int NPoints); ~Tabu(); void StartTabu(); double CalcArrays(); }; //------------------------------------------------------------------Tabu::Tabu(double *Array, double *SPos, int NPoints) { NAngles=90; StepSize=0.01; AStep=M_PI*2/NAngles; NSpeakers=NPoints; MMax=99999999; for(int a=0;a<(NPoints*2)-1;a++) { //Copy initial Startup array Current[a]=CBest[a]=OBest[a]=Array[a]; SPosition[a]=SPos[a]; MUp[a]=MDown[a]=0; } W=1/(sqrt(2.0f)); ResBestOverall=CalcArrays(); } //------------------------------------------------------------------Tabu::~Tabu() { } //------------------------------------------------------------------void Tabu::StartTabu() { double CMax; ResBestLocal=999999; for(int control=0;control<(NSpeakers*2)-2;control++) { if(control==(NSpeakers*2)-2) CMax=2.0f; else CMax=1.0f; for(int test=1;test<3;test++) - 301 - Appendix { if(!MUp[control] && test==1) { if(Current[control]>=CMax) { Current[control]=CMax; MUp[control]+=5; CDir[control]=0; } else { Current[control]+=StepSize; CDir[control]=1; } } else if(test==1) { CDir[control]=0; } if(!MDown[control] && test==2) { if(Current[control]<=0) { Current[control]=0; MDown[control]+=5; CDir[control]=0; } else { Current[control]-=StepSize; CDir[control]=-1; } } else if(test==2) { CDir[control]=0; } if(MUp[control]&&MDown[control]) { CDir[control]=0; } if(CDir[control]) { ResCurrent=CalcArrays(); } else { ResCurrent=999999; } if(ResCurrent<ResBestLocal) { ResCDir=CDir[control]; ResControl=control; for(int a=0;a<(NSpeakers*2)-1;a++) CBest[a]=Current[a]; ResBestLocal=ResCurrent; } Current[control]-=StepSize - 302 - Appendix *((double)CDir[control]); } if(MDown[control]>MMax) MDown[control]=MMax; if(MUp[control]>MMax) MUp[control]=MMax; if(MDown[control]) MDown[control]--; if(MUp[control]) MUp[control]--; } if(ResCDir==1) MDown[ResControl]+=5; if(ResCDir==-1) MUp[ResControl]+=5; for(int a=0;a<(NSpeakers*2)-1;a++) { Current[a]=CBest[a]; } if(ResBestLocal<ResBestOverall) { ResBestOverall=ResBestLocal; for(int a=0;a<(NSpeakers*2)-1;a++) OBest[a]=CBest[a]; } } //------------------------------------------------------------------double Tabu::CalcArrays() { if(!NSpeakers) Application->MessageBox("Stop1",NULL,NULL); double Ll=Current[8]; double w1=Current[0],x1=Current[1],y1=0; double w2=Current[2],x2=Current[3],y2=Current[4]; double w3=Current[5],x3=Current[6],y3=Current[7]; double iLl=1/Ll,P; int i=0; MFit=VFit=AFit=E=0; for(double Ang=0;Ang<2*M_PI;Ang+=AStep) { X=cos(Ang); Y=sin(Ang); WSig=(0.5*(Ll+iLl)*W) + ((1/sqrt(8))*(Ll-iLl)*X); XSig=(0.5*(Ll+iLl)*X) + ((1/sqrt(2))*(Ll-iLl)*W); YSig=Y; SGain[0]=(w1*WSig) SGain[1]=(w2*WSig) SGain[2]=(w3*WSig) SGain[3]=(w3*WSig) SGain[4]=(w2*WSig) + + + (x1*XSig) (x2*XSig) (x3*XSig) (x3*XSig) (x2*XSig) + + + - (y1*YSig); (y2*YSig); (y3*YSig); (y3*YSig); (y2*YSig); P=0;Vx[i]=0;Vy[i]=0;E=0;V2x[i]=0;V2y[i]=0; if(!NSpeakers) Application->MessageBox("Stop2",NULL,NULL); for(int a=0;a<NSpeakers;a++) { P+=SGain[a]; E+=SGain[a]*SGain[a]; } if(i==0) VolScale=P; for(int a=0;a<NSpeakers;a++) { Vx[i]+=SGain[a]*cos(SPosition[a]); Vy[i]+=SGain[a]*sin(SPosition[a]); V2x[i]+=SGain[a]*SGain[a]*cos(SPosition[a]); V2y[i]+=SGain[a]*SGain[a]*sin(SPosition[a]); - 303 - Appendix } if(P) { Vx[i]/=P; Vy[i]/=P; V2x[i]/=E; V2y[i]/=E; } VFit+=(1-(VolScale/P))*(1-(VolScale/P)); MFit+=pow(1-sqrt((Vx[i]*Vx[i])+(Vy[i]*Vy[i])),2); double tAng=Ang-atan2(Vy[i],Vx[i]); if(tAng>M_PI) tAng-=(2*M_PI); if(tAng<-M_PI) tAng+=(2*M_PI); double tAng2=Ang-atan2(V2y[i],V2x[i]); if(tAng2>M_PI) tAng2-=(2*M_PI); if(tAng2<-M_PI) tAng2+=(2*M_PI); AFit2+=tAng2*tAng2; i++; } VFit=sqrt(VFit/(double)NAngles); MFit=sqrt(MFit/(double)NAngles); AFit=sqrt(AFit/(double)NAngles); AFit2=sqrt(AFit2/(double)NAngles); return(AFit+(AFit2)+(MFit*4.0f/5.0f)+(VFit)); } #endif - 304 - Appendix //------------------------------------------------------------------//-------------------------HIGHTABU.H-------------------------------//------------------------------------------------------------------#ifndef HighTabuH #define HighTabuH #include <math.h> class HighTabu { private: double Current[32],SPosition[32],SGain[32],Vx[512],Vy[512]; double ResCurrent; double MFit,VFit,AFit,AFit2,P,VolScale,E; double NAngles,AStep; double W,X,Y,WSig,XSig,YSig; int NSpeakers,ResControl,CDir[32],ResCDir; public: double CBest[32],OBest[32],ResBestLocal,ResBestOverall; double StepSize; int MUp[32],MDown[32],MMax; HighTabu(double *Array, double *SPos, int NPoints); ~HighTabu(); void StartTabu(); double CalcArrays(); }; //------------------------------------------------------------------HighTabu::HighTabu(double *Array, double *SPos, int NPoints) { NAngles=90; StepSize=0.01; AStep=M_PI*2/NAngles; NSpeakers=NPoints; MMax=99999999; for(int a=0;a<(NPoints*2)-1;a++) { //Copy initial Startup array Current[a]=CBest[a]=OBest[a]=Array[a]; SPosition[a]=SPos[a]; MUp[a]=MDown[a]=0; } W=1/(sqrt(2.0f)); ResBestOverall=CalcArrays(); } //------------------------------------------------------------------HighTabu::~HighTabu() { } //------------------------------------------------------------------void HighTabu::StartTabu() { double CMax; ResBestLocal=999999; for(int control=0;control<(NSpeakers*2)-1;control++) { if(control==(NSpeakers*2)-2) CMax=2.0f; else CMax=1.0f; for(int test=1;test<3;test++) - 305 - Appendix { if(!MUp[control] && test==1) { if(Current[control]>=CMax) { Current[control]=CMax; MUp[control]+=5; CDir[control]=0; } else { Current[control]+=StepSize; CDir[control]=1; } } else if(test==1) { CDir[control]=0; } if(!MDown[control] && test==2) { if(Current[control]<=0) { Current[control]=0; MDown[control]+=5; CDir[control]=0; } else { Current[control]-=StepSize; CDir[control]=-1; } } else if(test==2) { CDir[control]=0; } if(MUp[control]&&MDown[control]) { CDir[control]=0; } if(CDir[control]) { ResCurrent=CalcArrays(); } else { ResCurrent=999999; } if(ResCurrent<ResBestLocal) { ResCDir=CDir[control]; ResControl=control; for(int a=0;a<(NSpeakers*2)-1;a++) CBest[a]=Current[a]; ResBestLocal=ResCurrent; } Current[control]-=StepSize* - 306 - Appendix ((double)CDir[control]); } if(MDown[control]>MMax) MDown[control]=MMax; if(MUp[control]>MMax) MUp[control]=MMax; if(MDown[control]) MDown[control]--; if(MUp[control]) MUp[control]--; } if(ResCDir==1) MDown[ResControl]+=5; if(ResCDir==-1) MUp[ResControl]+=5; for(int a=0;a<(NSpeakers*2)-1;a++) { Current[a]=CBest[a]; } if(ResBestLocal<ResBestOverall) { ResBestOverall=ResBestLocal; for(int a=0;a<(NSpeakers*2)-1;a++) OBest[a]=CBest[a]; } } //------------------------------------------------------------------double HighTabu::CalcArrays() { if(!NSpeakers) Application->MessageBox("Stop1",NULL,NULL); double Ll=Current[8]; double w1=Current[0],x1=Current[1],y1=0; double w2=Current[2],x2=Current[3],y2=Current[4]; double w3=Current[5],x3=Current[6],y3=Current[7]; double iLl=1/Ll,P; int i=0; MFit=VFit=AFit=0; for(double Ang=0;Ang<2*M_PI;Ang+=AStep) { X=cos(Ang); Y=sin(Ang); WSig=(0.5*(Ll+iLl)*W) + ((1/sqrt(8))*(Ll-iLl)*X); XSig=(0.5*(Ll+iLl)*X) + ((1/sqrt(2))*(Ll-iLl)*W); YSig=Y; SGain[0]=(w1*WSig) SGain[1]=(w2*WSig) SGain[2]=(w3*WSig) SGain[3]=(w3*WSig) SGain[4]=(w2*WSig) + + + (x1*XSig) (x2*XSig) (x3*XSig) (x3*XSig) (x2*XSig) + + + - (y1*YSig); (y2*YSig); (y3*YSig); (y3*YSig); (y2*YSig); P=0;Vx[i]=0;Vy[i]=0,E=0; for(int a=0;a<NSpeakers;a++) { P+=SGain[a]*SGain[a]; E+=SGain[a]*SGain[a]; } if(i==0) VolScale=P; for(int a=0;a<NSpeakers;a++) { Vx[i]+=SGain[a]*SGain[a]*cos(SPosition[a]); Vy[i]+=SGain[a]*SGain[a]*sin(SPosition[a]); } if(E) { Vx[i]/=E; - 307 - Appendix Vy[i]/=E; } VFit+=(1-(VolScale/P))*(1-(VolScale/P)); MFit+=pow(1-sqrt((Vx[i]*Vx[i])+(Vy[i]*Vy[i])),2); double tAng=Ang-atan2(Vy[i],Vx[i]); if(tAng>M_PI) tAng-=(2*M_PI); if(tAng<-M_PI) tAng+=(2*M_PI); AFit+=tAng*tAng; i++; } VFit=sqrt(VFit/(double)NAngles); MFit=sqrt(MFit/(double)NAngles); AFit=sqrt(AFit/(double)NAngles); return(AFit+MFit/3+VFit/2); } #endif - 308 - Appendix 9.2.2 Windows C++ Code used in the Real-Time Audio System Software //------------------------------------------------------------------//---------------------------MAIN.CPP-------------------------------//------------------------------------------------------------------#include <vcl.h> #pragma hdrstop #include "Main.h" #include "WigSound2.h" //------------------------------------------------------------------#pragma package(smart_init) #pragma resource "*.dfm" TAmbiToAll *AmbiToAll; WigSound2 *WAudio; //------------------------------------------------------------------__fastcall TAmbiToAll::TAmbiToAll(TComponent* Owner) : TForm(Owner) { } //------------------------------------------------------------------void __fastcall TAmbiToAll::FormCreate(TObject *Sender) { WAudio = new WigSound2(this); //Gives this pointer to the form class Button2->Enabled=false; Button3->Enabled=false; Button4->Enabled=false; ScrollBar2Change(ScrollBar2); ScrollBar3Change(ScrollBar3); } //------------------------------------------------------------------void __fastcall TAmbiToAll::Button1Click(TObject *Sender) { unsigned short Buff=2049; int nchan = (NumChannels->ItemIndex+1)*2; m_volume = -ScrollBar2->Position/100.0f; if(SampleRate->ItemIndex==1) { WAudio->InitMem(nchan,Buff,48000); WAudio->SkipAudio(ScrollBar1->Position); WAudio->Initialise(nchan,48000,Buff,4,4); } else { WAudio->InitMem(nchan,Buff,44100); WAudio->SkipAudio(ScrollBar1->Position); WAudio->Initialise(nchan,44100,Buff,4,4); } WAudio->OpenDevice(1); Button1->Enabled=false; Button3->Enabled=true; Button4->Enabled=false; } //------------------------------------------------------------------void __fastcall TAmbiToAll::Button3Click(TObject *Sender) { WAudio->Pause(); Button2->Enabled=true; - 309 - Appendix Button3->Enabled=false; Button4->Enabled=true; } //------------------------------------------------------------------void __fastcall TAmbiToAll::Button2Click(TObject *Sender) { WAudio->SkipAudio(ScrollBar1->Position); WAudio->UnPause(); Button2->Enabled=false; Button3->Enabled=true; Button4->Enabled=false; } //------------------------------------------------------------------void __fastcall TAmbiToAll::Button4Click(TObject *Sender) { unsigned short Buff=2049; Button1->Enabled=true; Button2->Enabled=false; Button3->Enabled=false; Button4->Enabled=false; WAudio->CloseDevice(1); WAudio->UnInitMem(2,Buff); ScrollBar1->Position = 0; } //------------------------------------------------------------------void __fastcall TAmbiToAll::FormDestroy(TObject *Sender) { if(Button3->Enabled) { Button3Click(Button3); Sleep(400); } if(Button4->Enabled) { Button4Click(Button4); Sleep(400); } delete WAudio; } //------------------------------------------------------------------void __fastcall TAmbiToAll::WButClick(TObject *Sender) { TEdit *ptr = (TEdit *)Sender; char *cptr = ptr->Name.c_str(); bool result; if(cptr[0]!='c') result = OpenDialog1->Execute(); else result = true; if(result) { switch(cptr[0]) { case 'W': WFName = OpenDialog1->FileName; WEdit->Text = WFName; break; case 'X': XFName = OpenDialog1->FileName; XEdit->Text = XFName; break; case 'Y': - 310 - Appendix YFName = OpenDialog1->FileName; YEdit->Text = YFName; break; case 'Z': ZFName = OpenDialog1->FileName; ZEdit->Text = ZFName; case 'c': switch(cptr[1]) { case 'W': WFName = NULL; WEdit->Text = WFName; break; case 'X': XFName = NULL; XEdit->Text = XFName; break; case 'Y': YFName = NULL; YEdit->Text = YFName; break; case 'Z': ZFName = NULL; ZEdit->Text = ZFName; break; } break; } } } //------------------------------------------------------------------void TAmbiToAll::UpdateWaveTime(unsigned long WRead) { WaveRead = WRead; ScrollBar1->Position = (int)((float)(WaveRead)*200.0f/(float)(WaveSize)); } //------------------------------------------------------------------void __fastcall TAmbiToAll::RotorSlider1Change(TObject *Sender) { Label1->Caption = IntToStr((int)(360 – RotorSlider1->DotPosition + 0.5f)); RotAngle = -RotorSlider1->DotPosition*M_PI/180.0f; } //------------------------------------------------------------------void __fastcall TAmbiToAll::AmbiEffectClick(TObject *Sender) { m_effect = AmbiEffect->ItemIndex; } //------------------------------------------------------------------void __fastcall TAmbiToAll::RotorSlider2Change(TObject *Sender) { Label2->Caption = IntToStr((int)(360 - RotorSlider2>DotPosition+0.5f)); monopan = -RotorSlider2->DotPosition*M_PI/180.0f; } //------------------------------------------------------------------void __fastcall TAmbiToAll::TransFilterClick(TObject *Sender) { WAudio->UpdateFilter = true; } - 311 - Appendix //------------------------------------------------------------------void __fastcall TAmbiToAll::ScrollBar2Change(TObject *Sender) { float db; m_volume = -ScrollBar2->Position/100.0f; if(m_volume) { db = 20 * log10(m_volume); Label5->Caption = FloatToStrF(db,ffFixed,3,1) + "dB"; } else Label5->Caption = "-Inf"; } //------------------------------------------------------------------void __fastcall TAmbiToAll::RearFilterClick(TObject *Sender) { WAudio->UpdateRearFilter = true; } //------------------------------------------------------------------void __fastcall TAmbiToAll::ScrollBar3Change(TObject *Sender) { m_width = -ScrollBar3->Position/100.0f; Label6->Caption = FloatToStrF(m_width,ffFixed,4,2); } //------------------------------------------------------------------void __fastcall TAmbiToAll::RotorSlider3Change(TObject *Sender) { Label9->Caption = IntToStr( (int)(RotorSlider3->DotPosition - 90.0f + 0.5f)); TiltAngle = (RotorSlider3->DotPosition - 90.0f)*M_PI/180.0f; } //------------------------------------------------------------------- - 312 - Appendix //------------------------------------------------------------------//-----------------------------MAIN.H-------------------------------//------------------------------------------------------------------#ifndef MainH #define MainH //------------------------------------------------------------------#include <Classes.hpp> #include <Controls.hpp> #include <StdCtrls.hpp> #include <Forms.hpp> #include "RotorSlider.h" #include "LevelMeter2.h" #include "Oscilloscope.h" #include "GLGraph.h" #include <ComCtrls.hpp> #include <ExtCtrls.hpp> #include <Dialogs.hpp> //------------------------------------------------------------------class TAmbiToAll : public TForm { __published: // IDE-managed Components TButton *Button1; TButton *Button2; TButton *Button3; TButton *Button4; TEdit *WEdit; TEdit *XEdit; TEdit *YEdit; TEdit *ZEdit; TButton *WBut; TButton *XBut; TButton *YBut; TButton *ZBut; TOpenDialog *OpenDialog1; TScrollBar *ScrollBar1; TButton *cW; TButton *cX; TButton *cY; TButton *cZ; TRotorSlider *RotorSlider1; TLabel *Label1; TOscilloscope *Oscilloscope1; TOscilloscope *Oscilloscope2; TRadioGroup *AmbiEffect; TRadioGroup *AmbiInput; TRotorSlider *RotorSlider2; TLabel *Label2; TLabel *Label3; TLabel *Label4; TRadioGroup *NumChannels; TRadioGroup *SampleRate; TRadioGroup *TransFilter; TRadioGroup *RearFilter; TScrollBar *ScrollBar2; TLabel *Label5; TScrollBar *ScrollBar3; TLabel *Label6; TLabel *Label7; TLabel *Label8; TRotorSlider *RotorSlider3; TLabel *Label9; TLabel *Label10; - 313 - Appendix void __fastcall Button1Click(TObject *Sender); void __fastcall Button3Click(TObject *Sender); void __fastcall Button2Click(TObject *Sender); void __fastcall Button4Click(TObject *Sender); void __fastcall FormCreate(TObject *Sender); void __fastcall FormDestroy(TObject *Sender); void __fastcall WButClick(TObject *Sender); void __fastcall RotorSlider1Change(TObject *Sender); void __fastcall AmbiEffectClick(TObject *Sender); void __fastcall RotorSlider2Change(TObject *Sender); void __fastcall TransFilterClick(TObject *Sender); void __fastcall ScrollBar2Change(TObject *Sender); void __fastcall RearFilterClick(TObject *Sender); void __fastcall ScrollBar3Change(TObject *Sender); void __fastcall RotorSlider3Change(TObject *Sender); private: // User declarations bool TWriting; // User declarations public: unsigned long WaveRead; unsigned long WaveSize; void UpdateWaveTime(unsigned long WRead); __fastcall TAmbiToAll(TComponent* Owner); AnsiString WFName, XFName, YFName, ZFName; short m_effect; float m_volume,m_width; float RotAngle,monopan,TiltAngle; }; //------------------------------------------------------------------extern PACKAGE TAmbiToAll *AmbiToAll; //------------------------------------------------------------------#endif - 314 - Appendix //------------------------------------------------------------------//--------------------------WIGSOUND.H------------------------------//------------------------------------------------------------------#ifndef WigSoundH #define WigSoundH #include <mmsystem.h> class WigSound { private: WAVEHDR *WaveHeadersOut,*WaveHeadersIn,*SampleBuffer; HWAVEOUT hWaveOut; HWAVEIN hWaveIn; MMRESULT Error; unsigned int NoOfBuffers,NoOfQueueBuffers; unsigned short NoOfChannels,BufferLengthPerChannel; friend void CALLBACK WaveOutCallback(HWAVEOUT hwo, UINT uMsg, WORD dwInstance,DWORD dwParam1, DWORD dwParam2); friend void CALLBACK WaveInCallback(HWAVEIN hwi, UINT uMsg, DWORD dwInstance,DWORD dwParam1, DWORD dwParam2); void ClearBufferFromFIFO(); void ProcessErrorIn(MMRESULT Error); void ProcessErrorOut(MMRESULT Error); protected: WAVEFORMATEX WaveFormat; public: WigSound(); void Initialise(unsigned short usNoOfChannels, unsigned long usSampleRate,unsigned short usBufferLengthPerChannel, unsigned int uiNoOfBuffers,unsigned int uiNoOfQueueBuffers); virtual void ProcessAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel); virtual void MonitorAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel); void ProcessAudioIn(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel); void OpenDevice(UINT Device); void CloseDevice(UINT Device); void Pause(); void UnPause(); void WaveInFunc(WAVEHDR *pWaveHeader); void WaveOutFunc(WAVEHDR *pWaveHeader); bool Closing,Paused; WAVEHDR *ReadBuffer,*WriteBuffer; }; //------------------------------------------------------------------WigSound::WigSound() { } //------------------------------------------------------------------void WigSound::Initialise( unsigned short usNoOfChannels,unsigned long usSampleRate, unsigned short usBufferLengthPerChannel, unsigned int uiNoOfBuffers,unsigned int uiNoOfQueueBuffers) { WaveFormat.wFormatTag = WAVE_FORMAT_PCM; - 315 - Appendix WaveFormat.nChannels = usNoOfChannels; WaveFormat.nSamplesPerSec = usSampleRate; WaveFormat.wBitsPerSample = 16; WaveFormat.nBlockAlign = (unsigned short)(usNoOfChannels*16/8); WaveFormat.nAvgBytesPerSec = (unsigned long)(usSampleRate*WaveFormat.nBlockAlign); WaveFormat.cbSize = 0; NoOfBuffers NoOfQueueBuffers NoOfChannels BufferLengthPerChannel SampleBuffer WriteBuffer ReadBuffer WaveHeadersOut WaveHeadersIn Closing Paused = uiNoOfBuffers; = uiNoOfQueueBuffers; = usNoOfChannels; = usBufferLengthPerChannel; = new WAVEHDR[NoOfQueueBuffers]; = SampleBuffer; = SampleBuffer; = new WAVEHDR[NoOfBuffers]; = new WAVEHDR[NoOfBuffers]; = false; = true; for(UINT i=0;i<NoOfBuffers;i++) { WaveHeadersOut[i].dwBufferLength = usBufferLengthPerChannel*16*usNoOfChannels/8; WaveHeadersOut[i].lpData = new char[WaveHeadersOut[i].dwBufferLength]; memset(WaveHeadersOut[i].lpData,0,WaveHeadersOut[i].dwBufferLength); WaveHeadersOut[i].dwFlags=0; WaveHeadersOut[i].dwLoops=0; WaveHeadersIn[i].dwBufferLength = usBufferLengthPerChannel*16*usNoOfChannels/8; WaveHeadersIn[i].lpData = new char[WaveHeadersIn[i].dwBufferLength]; memset(WaveHeadersIn[i].lpData,0,WaveHeadersIn[i].dwBufferLength); WaveHeadersIn[i].dwFlags=0; WaveHeadersIn[i].dwLoops=0; } for(UINT i=0;i<NoOfQueueBuffers;i++) { SampleBuffer[i].dwBufferLength = usBufferLengthPerChannel*16*usNoOfChannels/8; SampleBuffer[i].lpData = new char[SampleBuffer[i].dwBufferLength]; memset(SampleBuffer[i].lpData,0,SampleBuffer[i].dwBufferLength); SampleBuffer[i].dwFlags = 0; SampleBuffer[i].dwLoops = 0; } } //------------------------------------------------------------------void WigSound::OpenDevice(UINT Device) { Device?Device--:Device=WAVE_MAPPER; Error = waveOutOpen(&hWaveOut,Device,&WaveFormat, (DWORD)WaveOutCallback, - 316 - Appendix if(Error) (DWORD)this,CALLBACK_FUNCTION); ProcessErrorOut(Error); Error = waveOutPause(hWaveOut); if(Error) ProcessErrorOut(Error); for(UINT i=0;i<NoOfBuffers;i++) { Error = waveOutPrepareHeader(hWaveOut, &WaveHeadersOut[i],sizeof(WaveHeadersOut[i])); if(Error) ProcessErrorOut(Error); Error = waveOutWrite(hWaveOut, &WaveHeadersOut[i],sizeof(WaveHeadersOut[i])); if(Error) ProcessErrorOut(Error); } Error = waveInOpen(&hWaveIn,Device,&WaveFormat, (DWORD)WaveInCallback, (DWORD)this,CALLBACK_FUNCTION); if(Error) ProcessErrorIn(Error); for(UINT i=0;i<NoOfBuffers;i++) { Error = waveInPrepareHeader(hWaveIn, &WaveHeadersIn[i], sizeof(WaveHeadersIn[i])); if(Error) ProcessErrorIn(Error); Error = waveInAddBuffer(hWaveIn,&WaveHeadersIn[i], sizeof(WaveHeadersIn[i])); if(Error) ProcessErrorIn(Error); } Error = waveOutRestart(hWaveOut); if(Error) ProcessErrorOut(Error); Error = waveInStart(hWaveIn); if(Error) ProcessErrorIn(Error); Paused=false; } //------------------------------------------------------------------void WigSound::CloseDevice(UINT Device) { Closing=true; Error = waveInReset(hWaveIn); if(Error) ProcessErrorIn(Error); Error = waveOutReset(hWaveOut); if(Error) ProcessErrorOut(Error); Sleep(300); for(UINT i=0;i<NoOfBuffers;i++) { Error = waveOutUnprepareHeader(hWaveOut, &WaveHeadersOut[i],sizeof(WaveHeadersOut[i])); if(Error) ProcessErrorOut(Error); if(WaveHeadersOut[i].lpData) delete [] WaveHeadersOut[i].lpData; Error = waveInUnprepareHeader(hWaveIn, &WaveHeadersIn[i], sizeof(WaveHeadersIn[i])); - 317 - Appendix if(Error) ProcessErrorIn(Error); if(WaveHeadersIn[i].lpData) delete [] WaveHeadersIn[i].lpData; } for(UINT i=0;i<NoOfQueueBuffers;i++) { if(SampleBuffer[i].lpData) delete [] SampleBuffer[i].lpData; } if(WaveHeadersOut) delete [] WaveHeadersOut; if(WaveHeadersIn) delete [] WaveHeadersIn; if(SampleBuffer) delete [] SampleBuffer; Error = waveInClose(hWaveIn); if(Error) ProcessErrorIn(Error); Error = waveOutClose(hWaveOut); if(Error) ProcessErrorOut(Error); } //------------------------------------------------------------------void WigSound::Pause() { Paused=true; } //------------------------------------------------------------------void WigSound::UnPause() { Paused=false; } //------------------------------------------------------------------void WigSound::ProcessErrorIn(MMRESULT Error) { char Text[256]; waveInGetErrorText(Error,Text,sizeof(Text)); MessageBox(NULL,Text,"Error",MB_OK); } //------------------------------------------------------------------void WigSound::ProcessErrorOut(MMRESULT Error) { char Text[256]; waveOutGetErrorText(Error,Text,sizeof(Text)); MessageBox(NULL,Text,"Error",MB_OK); } //------------------------------------------------------------------void WigSound::WaveInFunc(WAVEHDR *pWaveHeader) { ProcessAudioIn(pWaveHeader,NoOfChannels, BufferLengthPerChannel); Error = waveInAddBuffer(hWaveIn,pWaveHeader, sizeof(*pWaveHeader)); } //------------------------------------------------------------------void WigSound::WaveOutFunc(WAVEHDR *pWaveHeader) { ProcessAudio(pWaveHeader,NoOfChannels, BufferLengthPerChannel); ClearBufferFromFIFO(); Error = waveOutWrite(hWaveOut,pWaveHeader, sizeof(*pWaveHeader)); } - 318 - Appendix //------------------------------------------------------------------void CALLBACK WaveOutCallback(HWAVEOUT hwo, UINT uMsg, DWORD dwInstance,DWORD dwParam1, DWORD dwParam2) { WigSound *me = (WigSound *)dwInstance; switch(uMsg) { case WOM_DONE: { if(!me->Closing) me->WaveOutFunc((WAVEHDR *)dwParam1); break; } default: break; } } //------------------------------------------------------------------void CALLBACK WaveInCallback(HWAVEIN hwi, UINT uMsg, DWORD dwInstance, DWORD dwParam1, DWORD dwParam2) { WigSound *me = (WigSound *)dwInstance; switch(uMsg) { case WIM_DATA: { if(!me->Closing) me->WaveInFunc((WAVEHDR *)dwParam1); break; } default: break; } } //------------------------------------------------------------------void WigSound::ProcessAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { } //------------------------------------------------------------------void WigSound::MonitorAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { } //------------------------------------------------------------------void WigSound::ProcessAudioIn(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { memcpy(WriteBuffer->lpData,pWaveHeader->lpData, pWaveHeader->dwBufferLength); WriteBuffer++; if(WriteBuffer>&SampleBuffer[NoOfQueueBuffers-1]) WriteBuffer=&SampleBuffer[NoOfQueueBuffers-1]; MonitorAudio(pWaveHeader,usNoOfChannels, usBufferLengthPerChannel); - 319 - Appendix } //------------------------------------------------------------------void WigSound::ClearBufferFromFIFO() { for(UINT i=0;i<NoOfQueueBuffers-1;i++) { memcpy(SampleBuffer[i].lpData, SampleBuffer[i+1].lpData, SampleBuffer[i].dwBufferLength); } if(WriteBuffer>SampleBuffer) WriteBuffer--; } //------------------------------------------------------------------#endif - 320 - Appendix //------------------------------------------------------------------//-------------------------WIGSOUND2.H------------------------------//------------------------------------------------------------------#ifndef WigSoundH2 #define WigSoundH2 #include #include #include #include #include #include #include <fstream.h> "WigSound.h" "WigAmbi.h" "WaveFile.h" "FastConv.h" "AllPass.h" "Main.h" #define BLEN 4096 #define FFTORDER 12 #define FFTSIZE 4096 class WigSound2 : public WigSound { private: float **Samples,**Decode,*SElev,*SAzim,*mono; bool bSkip; long SkipOffset; AmbiBuffer *ABuf,*BBuf; int NoOfSpeakers,SampleRate; AnsiString DIR; //For 2 ears FastFilter *WF,*XF,*YF,*ZF; FastFilter *WF2D,*XF2D,*YF2D; //For 4 ears FastFilter *WFf,*WFr,*XFf,*XFr,*YFf,*YFr; //For Front... FastFilter *h1fl,*h2fl,*h1fr,*h2fr; // and Back X-Talk Cancellation Filters FastFilter *h1rl,*h2rl,*h1rr,*h2rr; //AllPass Filters for cheap Ambisonics decoder AllPass *WAP,*XAP,*YAP; void LoadFilters(int SRate); void UnloadFilters(); void ChooseFilter(int SRate); void ChooseRearFilter(int SRate); void B2Headphones(AmbiBuffer *Signal, float **Samples, int NoOfChannels); void B2Headphones2D(AmbiBuffer *Signal, float **Samples, int NoOfChannels); void B2Headphones4(AmbiBuffer *Signal, AmbiBuffer *Signal2, float **Samples,int NoOfChannels); void B2Trans(AmbiBuffer *Signal,float *Left,float *Right, int NoOfChannels,FastFilter *h1, FastFilter *h2, FastFilter *h1r, FastFilter *h2r); public: WigSound2(TAmbiToAll *Sender); ~WigSound2(); void InitMem(unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel, int SRate); void UnInitMem( unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel); - 321 - Appendix void ProcessAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel); void MonitorAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel); void SkipAudio(int Offset); WigFile WFile,XFile,YFile,ZFile; TAmbiToAll *Window; bool UpdateFilter,UpdateRearFilter; }; //------------------------------------------------------------------WigSound2::WigSound2(TAmbiToAll *Sender) { Window = Sender; NoOfSpeakers=8; SkipOffset = 0; bSkip = false; UpdateFilter = false; DIR = GetCurrentDir(); DIR+="\\"; } WigSound2::~WigSound2() { } void WigSound2::LoadFilters(int SRate) { AnsiString wname,xname,yname,zname; ZF=NULL; if(SRate==48000) { wname = DIR + "Wh481024.dat"; xname = DIR + "Xh481024.dat"; yname = DIR + "Yh481024.dat"; zname = DIR + "Zh481024.dat"; WF = new FastFilter(FFTORDER,&wname,1024); XF = new FastFilter(FFTORDER,&xname,1024); YF = new FastFilter(FFTORDER,&yname,1024,1); ZF = new FastFilter(FFTORDER,&zname,1024); wname = DIR + "Wh4810242D.dat"; xname = DIR + "Xh4810242D.dat"; yname = DIR + "Yh4810242D.dat"; WF2D = new FastFilter(FFTORDER,&wname,1024); XF2D = new FastFilter(FFTORDER,&xname,1024); YF2D = new FastFilter(FFTORDER,&yname,1024,1); wname = DIR + "WhFront1024.dat"; xname = DIR + "XhFront1024.dat"; yname = DIR + "YhFront1024.dat"; WFf = new FastFilter(FFTORDER,&wname,1024); XFf = new FastFilter(FFTORDER,&xname,1024); YFf = new FastFilter(FFTORDER,&yname,1024,1); wname = DIR + "WhRear1024.dat"; xname = DIR + "XhRear1024.dat"; yname = DIR + "YhRear1024.dat"; WFr = new FastFilter(FFTORDER,&wname,1024); XFr = new FastFilter(FFTORDER,&xname,1024); YFr = new FastFilter(FFTORDER,&yname,1024,1); wname = DIR + "h1348.dat"; xname = DIR + "h2348.dat"; h1fl = new FastFilter(FFTORDER,&wname,2048); - 322 - Appendix h2fl = new FastFilter(FFTORDER,&xname,2048); h1fr = new FastFilter(FFTORDER,&wname,2048); h2fr = new FastFilter(FFTORDER,&xname,2048); } else { wname = DIR + "Wh1024.dat"; xname = DIR + "Xh1024.dat"; yname = DIR + "Yh1024.dat"; zname = DIR + "Zh1024.dat"; WF = new FastFilter(FFTORDER,&wname,1024); XF = new FastFilter(FFTORDER,&xname,1024); YF = new FastFilter(FFTORDER,&yname,1024,1); ZF = new FastFilter(FFTORDER,&zname,1024); wname = DIR + "Wh1024.dat"; xname = DIR + "Xh1024.dat"; yname = DIR + "Yh1024.dat"; WF2D = new FastFilter(FFTORDER,&wname,1024); XF2D = new FastFilter(FFTORDER,&xname,1024); YF2D = new FastFilter(FFTORDER,&yname,1024,1); wname = DIR + "WhFront1024.dat"; xname = DIR + "XhFront1024.dat"; yname = DIR + "YhFront1024.dat"; WFf = new FastFilter(FFTORDER,&wname,1024); XFf = new FastFilter(FFTORDER,&xname,1024); YFf = new FastFilter(FFTORDER,&yname,1024,1); wname = DIR + "WhRear1024.dat"; xname = DIR + "XhRear1024.dat"; yname = DIR + "YhRear1024.dat"; WFr = new FastFilter(FFTORDER,&wname,1024); XFr = new FastFilter(FFTORDER,&xname,1024); YFr = new FastFilter(FFTORDER,&yname,1024,1); wname = DIR + "h13.dat"; xname = DIR + "h23.dat"; h1fl = new FastFilter(FFTORDER,&wname,2048); h2fl = new FastFilter(FFTORDER,&xname,2048); h1fr = new FastFilter(FFTORDER,&wname,2048); h2fr = new FastFilter(FFTORDER,&xname,2048); } } void WigSound2::UnloadFilters() { delete WF; delete XF; delete YF; delete ZF; delete WF2D; delete XF2D; delete YF2D; delete WFf; delete XFf; delete YFf; delete WFr; delete XFr; delete YFr; delete h1fl; delete h2fl; delete h1fr; delete h2fr; } void WigSound2::InitMem( unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel, - 323 - Appendix int SRate) { SampleRate = SRate; Samples = AllocSampleBuffer(usNoOfChannels, usBufferLengthPerChannel); ABuf = AmbiAllocate(usBufferLengthPerChannel,0,1); //BBuf used for 4-ear algorithms BBuf = AmbiAllocate(usBufferLengthPerChannel,0,1); SElev = new float[NoOfSpeakers]; SAzim = new float[NoOfSpeakers]; mono = new float[usBufferLengthPerChannel]; for(int i=0;i<NoOfSpeakers;i++) { SElev[i]=0; SAzim[i]=(M_PI/(float)NoOfSpeakers)+ i*2*M_PI/(float)NoOfSpeakers; } Decode=AllocDecodeArray(NoOfSpeakers,0); DecoderCalc(SAzim,SElev,NoOfSpeakers,0,sqrt(2),Decode); WFile.WaveFile(Window->WFName.c_str()); XFile.WaveFile(Window->XFName.c_str()); YFile.WaveFile(Window->YFName.c_str()); ZFile.WaveFile(Window->ZFName.c_str()); Window->WaveSize = WFile.GetWaveSize(); WAP = new AllPass(usBufferLengthPerChannel); XAP = new AllPass(usBufferLengthPerChannel); YAP = new AllPass(usBufferLengthPerChannel); WAP->SetCutOff(500.0f,(float)SRate); XAP->SetCutOff(500.0f,(float)SRate); YAP->SetCutOff(500.0f,(float)SRate); Application->GetNamePath(); LoadFilters(SRate); Window->Oscilloscope1->Prepare(); Window->Oscilloscope2->Prepare(); UpdateFilter = UpdateRearFilter = true; } void WigSound2::UnInitMem( unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { Window->Oscilloscope1->Unprepare(); Window->Oscilloscope2->Unprepare(); UnloadFilters(); delete WAP; delete XAP; delete YAP; WFile.CloseWaveFile(); XFile.CloseWaveFile(); YFile.CloseWaveFile(); ZFile.CloseWaveFile(); FreeSampleBuffer(Samples,usNoOfChannels); delete[] mono; delete[] SAzim; delete[] SElev; FreeDecodeArray(Decode,0); AmbiFree(ABuf); AmbiFree(BBuf); } - 324 - Appendix void WigSound2::MonitorAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { //Input Callback //Not Much Here as using Wave Files as input. } void WigSound2::ProcessAudio(WAVEHDR *pWaveHeader, unsigned short usNoOfChannels, unsigned short usBufferLengthPerChannel) { short *inPtr = (short *)ReadBuffer->lpData; short *outPtr = (short *)pWaveHeader->lpData; float yn; //Output Callback if(!Paused) { if(bSkip) { bSkip = false; //Scale Offset from 0->200 to 0->WaveSize SkipOffset = (long)(((double)SkipOffset/200.0)* (double)WFile.GetWaveSize()); //Guarantee an even number (as offset is in bytes) //and wave file data is in shorts SkipOffset = SkipOffset/2; SkipOffset = SkipOffset*2; //Offset all files WFile.SkipIntoFile(SkipOffset); XFile.SkipIntoFile(SkipOffset); YFile.SkipIntoFile(SkipOffset); ZFile.SkipIntoFile(SkipOffset); } switch(Window->AmbiInput->ItemIndex) { case 0: //Wave File WFile.GetWaveSamples(ABuf->W,ABuf->Length); XFile.GetWaveSamples(ABuf->X,ABuf->Length); YFile.GetWaveSamples(ABuf->Y,ABuf->Length); ZFile.GetWaveSamples(ABuf->Z,ABuf->Length); Window->UpdateWaveTime(WFile.GetWaveRead()); break; case 1: //Mono in to be panned WFile.GetWaveSamples(mono,ABuf->Length); Window->UpdateWaveTime(WFile.GetWaveRead()); Mono2B(mono,ABuf,Window->monopan,0.0f); break; case 2: //Live in DeInterlace(ReadBuffer, Samples,usNoOfChannels); break; } BTilt(ABuf,Window->TiltAngle); BRotate(ABuf,Window->RotAngle); const float vol = Window->m_volume; - 325 - Appendix switch(Window->m_effect) { case 0: WAP->ProcessAudio(ABuf->W,1.33,1.15); XAP->ProcessAudio(ABuf->X,1.33,1.15); YAP->ProcessAudio(ABuf->Y,1.33,1.15); B2Speakers(Decode,ABuf,Samples, usNoOfChannels,8,0); break; case 1: B2Headphones(ABuf,Samples, usNoOfChannels); break; case 2: B2Headphones2D(ABuf,Samples, usNoOfChannels); break; case 3: if(UpdateFilter) { ChooseFilter(SampleRate); UpdateFilter = false; } B2Headphones(ABuf,Samples, usNoOfChannels); B2Trans(ABuf,Samples[0],Samples[1], usNoOfChannels,h1fl,h2fl,h1fr,h2fr); break; case 4: if(UpdateFilter) { ChooseFilter(SampleRate); UpdateFilter = false; } if(UpdateRearFilter) { ChooseRearFilter(SampleRate); UpdateRearFilter = false; } B2Headphones4(ABuf,BBuf, Samples,usNoOfChannels); B2Trans(ABuf,Samples[0],Samples[1], usNoOfChannels,h1fl,h2fl,h1fr,h2fr); if(usNoOfChannels>=4) B2Trans(ABuf,Samples[2], Samples[3], usNoOfChannels,h1rl,h2rl, h1rr,h2rr); break; case 5: if(UpdateFilter) { ChooseFilter(SampleRate); UpdateFilter = false; } B2Trans(ABuf,Samples[0],Samples[1], usNoOfChannels,h1fl,h2fl,h1fr,h2fr); break; default: B2Speakers(Decode,ABuf,Samples, - 326 - Appendix usNoOfChannels,8,0); break; } //Do Volume for(int i=0;i<usBufferLengthPerChannel;i++) { for(int j=0;j<usNoOfChannels;j++) { Samples[j][i]*= vol; } } Window->Oscilloscope1->SampleArray = Samples[0]; Window->Oscilloscope2->SampleArray = Samples[1]; Window->Oscilloscope1->UpdateGraph(); Window->Oscilloscope2->UpdateGraph(); ReInterlace(pWaveHeader,Samples,usNoOfChannels); } else { memset(pWaveHeader->lpData,0, pWaveHeader->dwBufferLength); } } void WigSound2::SkipAudio(int Offset) { SkipOffset = (unsigned long)Offset; bSkip = true; } void WigSound2::B2Headphones(AmbiBuffer *Signal, float **Samples,int NoOfChannels) { const int Len = Signal->Length; const float Wid = Window->m_width; if(Window->m_effect==1 || Window->m_effect==2) { WF->OverAddFir(Signal->W,Wid); XF->OverAddFir(Signal->X,Wid); YF->OverAddFir(Signal->Y,Wid); if(ZF) ZF->OverAddFir(Signal->Z,Wid); } else { WF->OverAddFir(Signal->W); XF->OverAddFir(Signal->X); YF->OverAddFir(Signal->Y); if(ZF) ZF->OverAddFir(Signal->Z); } for(int i=0;i<Len;i++) { Samples[0][i] = 0.5*(Signal->W[i] + Signal->X[i] + Signal->Y[i] + Signal->Z[i]); Samples[1][i] = 0.5*(Signal->W[i] + Signal->X[i] – Signal->Y[i] + Signal->Z[i]); } for(int i=2;i<NoOfChannels;i++) { for(int j=0;j<Len;j++) - 327 - Appendix Samples[i][j] = 0.0f; } } void WigSound2::B2Headphones4(AmbiBuffer *Signal, AmbiBuffer *Signal2, float **Samples,int NoOfChannels) { const int Len = Signal->Length; if(NoOfChannels>=4) { memcpy(Signal2->W,Signal->W,sizeof(float)*Len); memcpy(Signal2->X,Signal->X,sizeof(float)*Len); memcpy(Signal2->Y,Signal->Y,sizeof(float)*Len); WFf->OverAddFir(Signal->W); XFf->OverAddFir(Signal->X); YFf->OverAddFir(Signal->Y); WFr->OverAddFir(Signal2->W); XFr->OverAddFir(Signal2->X); YFr->OverAddFir(Signal2->Y); for(int i=0;i<Len;i++) { Samples[0][i] = Signal->W[i] + Signal->X[i] + Signal->Y[i]; Samples[1][i] = Signal->W[i] + Signal->X[i] - Signal->Y[i]; Samples[2][i] = Signal2->W[i] + Signal2->X[i] + Signal2->Y[i]; Samples[3][i] = Signal2->W[i] + Signal2->X[i] - Signal2->Y[i]; } for(int i=4;i<NoOfChannels;i++) { for(int j=0;j<Len;j++) Samples[i][j] = 0.0f; } } } void WigSound2::B2Headphones2D(AmbiBuffer *Signal, float **Samples,int NoOfChannels) { const int Len = Signal->Length; const float Wid = Window->m_width; if(Window->m_effect==1 || Window->m_effect==2) { WF2D->OverAddFir(Signal->W,Wid); XF2D->OverAddFir(Signal->X,Wid); YF2D->OverAddFir(Signal->Y,Wid); } else { WF2D->OverAddFir(Signal->W); XF2D->OverAddFir(Signal->X); YF2D->OverAddFir(Signal->Y); } for(int i=0;i<Len;i++) { Samples[0][i] = + + Samples[1][i] = Signal->W[i] Signal->X[i] Signal->Y[i]; Signal->W[i] - 328 - Appendix + Signal->X[i] - Signal->Y[i]; } for(int i=2;i<NoOfChannels;i++) { for(int j=0;j<Len;j++) Samples[i][j] = 0.0f; } } void WigSound2::B2Trans(AmbiBuffer *Signal,float *Left, float *Right,int NoOfChannels, FastFilter *h1l, FastFilter *h2l, FastFilter *h1r, FastFilter *h2r) { const int Len = Signal->Length; const float Width = Window->m_width; float *tL = new float[Signal->Length]; float *tR = new float[Signal->Length]; memcpy(tL,Left,sizeof(float)*Len); memcpy(tR,Right,sizeof(float)*Len); h1l->OverAddFir(Left); h2l->OverAddFir(tL); h1r->OverAddFir(Right); h2r->OverAddFir(tR); for(int i=0;i<Len;i++) { Left[i] = Left[i] + (Width * tR[i]); Right[i] = Right[i] + (Width * tL[i]); } delete[] tL; delete[] tR; } void WigSound2::ChooseFilter(int SRate) { AnsiString h1name,h2name; if(SRate==44100) { switch(Window->TransFilter->ItemIndex) { case 0: h1name = DIR + "h13.dat"; h2name = DIR + "h23.dat"; break; case 1: h1name = DIR + "h15.dat"; h2name = DIR + "h25.dat"; break; case 2: h1name = DIR + "h110.dat"; h2name = DIR + "h210.dat"; break; case 3: h1name = DIR + "h120.dat"; h2name = DIR + "h220.dat"; break; case 4: h1name = DIR + "h130.dat"; - 329 - Appendix h2name = DIR + "h230.dat"; break; case 5: h1name = DIR + "h13b.dat"; h2name = DIR + "h23b.dat"; break; } } else if(SRate==48000) { switch(Window->TransFilter->ItemIndex) { case 0: h1name = DIR + "h1348.dat"; h2name = DIR + "h2348.dat"; break; case 1: h1name = DIR + "h1548.dat"; h2name = DIR + "h2548.dat"; break; case 2: h1name = DIR + "h11048.dat"; h2name = DIR + "h21048.dat"; break; case 3: h1name = DIR + "h12048.dat"; h2name = DIR + "h22048.dat"; break; case 4: h1name = DIR + "h13048.dat"; h2name = DIR + "h23048.dat"; break; case 5: h1name = DIR + "h13b48.dat"; h2name = DIR + "h23b48.dat"; break; } } delete h1fl; delete h2fl; delete h1fr; delete h2fr; h1fl = new FastFilter(FFTORDER,&h1name,2048); h2fl = new FastFilter(FFTORDER,&h2name,2048); h1fr = new FastFilter(FFTORDER,&h1name,2048); h2fr = new FastFilter(FFTORDER,&h2name,2048); } void WigSound2::ChooseRearFilter(int SRate) { AnsiString h1name,h2name; if(SRate==44100) { switch(Window->RearFilter->ItemIndex) { case 0: h1name = DIR + "h1175.dat"; h2name = DIR + "h2175.dat"; break; case 1: h1name = DIR + "h1170.dat"; h2name = DIR + "h2170.dat"; break; - 330 - Appendix case 2: h1name = DIR + "h1160.dat"; h2name = DIR + "h2160.dat"; break; case 3: h1name = DIR + "h1150.dat"; h2name = DIR + "h2150.dat"; break; case 4: h1name = DIR + "h1110.dat"; h2name = DIR + "h2110.dat"; break; } } else if(SRate==48000) { switch(Window->RearFilter->ItemIndex) { case 0: h1name = DIR + "h117548.dat"; h2name = DIR + "h217548.dat"; break; case 1: h1name = DIR + "h117048.dat"; h2name = DIR + "h217048.dat"; break; case 2: h1name = DIR + "h116048.dat"; h2name = DIR + "h216048.dat"; break; case 3: h1name = DIR + "h115048.dat"; h2name = DIR + "h215048.dat"; break; case 4: h1name = DIR + "h111048.dat"; h2name = DIR + "h211048.dat"; break; } } h1rl = new FastFilter(FFTORDER,&h1name,2048); h2rl = new FastFilter(FFTORDER,&h2name,2048); h1rr = new FastFilter(FFTORDER,&h1name,2048); h2rr = new FastFilter(FFTORDER,&h2name,2048); } #endif - 331 - Appendix //------------------------------------------------------------------//--------------------------ALLPASS.H-------------------------------//------------------------------------------------------------------#ifndef HALLPASS #define HALLPASS #include <math.h> //----------------------------------------------------------------//----------------------------------------------------------------class AllPass { private: float fs,fc,alpha,*Buffer; float ff,fb,in,out; const int BufLen; void DoAllPass(float *signal, int iLen, float aval); public: AllPass(int iLen); ~AllPass(); void SetCutOff(float fcut, float fsam); void ProcessAudio(float *signal, float dBLP, float dBHP, bool dummy); void ProcessAudio(float *signal, float LinLP, float LinHP); }; //----------------------------------------------------------------//----------------------------------------------------------------AllPass::AllPass(int iLen) : BufLen(iLen) { //Constructor - Set Default Cutoff, incase user doesn't ;-) SetCutOff(700.0f,44100.0f); ff=fb=in=out=0.0f; Buffer = new float[BufLen]; } AllPass::~AllPass() { delete[] Buffer; } inline void AllPass::SetCutOff(float fcut,float fsam) { fs = fsam; fc = fcut; float fcnorm = fc/fs; float w = 2*M_PI*fcnorm; float cw = cos(w); alpha = ((2-sqrt(pow(-2,2) - 4 * cw * cw)))/(2*cw); } //----------------------------------------------------------------inline void AllPass::DoAllPass(float *signal, int iLen, float aval) { float a,b; a = ff; b = fb; for(int i=0;i<iLen;i++) { out = (aval * signal[i]) - ff + (aval * fb); fb = out; ff = signal[i]; signal[i] = out; } } - 332 - Appendix //----------------------------------------------------------------void AllPass::ProcessAudio(float *signal, float dBLP, float dBHP , bool dummy) { float LinLP,LinHP,HP,LP; LinLP = pow(10,dBLP/20); LinHP = pow(10,dBHP/20); memcpy(Buffer,signal,sizeof(float) * BufLen); DoAllPass(Buffer,BufLen,alpha); for(int i=0;i<BufLen;i++) { HP = 0.5 * (signal[i] + Buffer[i]); LP = 0.5 * (signal[i] - Buffer[i]); signal[i] = LP * LinLP + HP * LinHP; } } //----------------------------------------------------------------void AllPass::ProcessAudio(float *signal, float LinLP, float LinHP) { float HP,LP; memcpy(Buffer,signal,sizeof(float) * BufLen); DoAllPass(Buffer,BufLen,alpha); for(int i=0;i<BufLen;i++) { HP = 0.5 * (signal[i] + Buffer[i]); LP = 0.5 * (signal[i] - Buffer[i]); signal[i] = (LP * LinLP) + (HP * LinHP); } } //----------------------------------------------------------------#endif - 333 - Appendix //------------------------------------------------------------------//---------------------------FASTFILTER.H---------------------------//------------------------------------------------------------------#ifndef HFASTCONV #define HFASTCONV #ifndef nsp_UsesTransform extern "C" { #define nsp_UsesTransform #include "nsp.h" } #endif #include <math.h> #include <fstream.h> class FastFilter { private: int order,fftsize,siglen,implen; float *OldArray,*Signal,*tconv,*h; SCplx *fh,*fSig,*fconv; public: FastFilter(int FFTOrder,AnsiString *FName,int FLength); FastFilter(int FFTOrder,AnsiString *FName, int FLength,bool inv); void ReLoadFilter(AnsiString *FName,int FLength); ~FastFilter(); void OverAddFir(float *signal); void OverAddFir(float *signal,float g); }; //------------------------------------------------------------------FastFilter::FastFilter(int FFTOrder,AnsiString *FName,int FLength) { order = FFTOrder; fftsize = pow(2,order); siglen = (fftsize/2) + 1; implen = fftsize/2; OldArray = new float[fftsize]; Signal = new float[fftsize]; tconv = new float[fftsize]; h = new float[fftsize]; fh = new SCplx[fftsize]; fSig = new SCplx[fftsize]; fconv = new SCplx[fftsize]; ReLoadFilter(FName,FLength); nspsRealFftNip(NULL,NULL,order,NSP_Init); nspsRealFftNip(h,fh,order,NSP_Forw); } //------------------------------------------------------------------FastFilter::FastFilter(int FFTOrder,AnsiString *FName,int FLength,bool inv) { order = FFTOrder; fftsize = pow(2,order); siglen = (fftsize/2) + 1; implen = fftsize/2; - 334 - Appendix OldArray = new float[fftsize]; Signal = new float[fftsize]; tconv = new float[fftsize]; h = new float[fftsize]; fh = new SCplx[fftsize]; fSig = new SCplx[fftsize]; fconv = new SCplx[fftsize]; ReLoadFilter(FName,FLength); for(int i=0;i<FLength;i++) { h[i] = -h[i]; } nspsRealFftNip(NULL,NULL,order,NSP_Init); nspsRealFftNip(h,fh,order,NSP_Forw); } //------------------------------------------------------------------FastFilter::~FastFilter() { delete[] tconv; delete[] OldArray; delete[] Signal; delete[] h; delete[] fh; delete[] fSig; delete[] fconv; } //------------------------------------------------------------------void FastFilter::ReLoadFilter(AnsiString *FName,int FLength) { FILE *f; int c; memset(OldArray,0,sizeof(float)*fftsize); memset(Signal,0,sizeof(float)*fftsize); memset(tconv,0,sizeof(float)*fftsize); memset(h,0,sizeof(float)*fftsize); memset(fh,0,sizeof(SCplx)*fftsize); memset(fSig,0,sizeof(SCplx)*fftsize); memset(fconv,0,sizeof(SCplx)*fftsize); f = fopen(FName->c_str(),"rb"); if(f) { c = fread(h,sizeof(float),FLength,f); if(c!=FLength) MessageBox(NULL,FName->c_str(), "Wrong Filter Length",NULL); fclose(f); } else MessageBox(NULL,FName->c_str(),"Couldn't Open File",NULL); } //------------------------------------------------------------------void FastFilter::OverAddFir(float *signal) { static unsigned int i,j=0,k; memcpy(Signal,signal,siglen*sizeof(float)); - 335 - Appendix //FFT Real Input Signal nspsRealFftNip(Signal,fSig,order,NSP_Forw); //Do processing in unrolled loop to maximise pipeline //usage for(i=0;i<implen;i+=4) { fconv[i].re = (fh[i].re * fSig[i].re) (fh[i].im * fSig[i].im); fconv[i].im = (fh[i].re * fSig[i].im) + (fh[i].im * fSig[i].re); fconv[i+1].re = (fh[i+1].re * fSig[i+1].re) (fh[i+1].im * fSig[i+1].im); fconv[i+1].im = (fh[i+1].re * fSig[i+1].im) + (fh[i+1].im * fSig[i+1].re); fconv[i+2].re = (fh[i+2].re * fSig[i+2].re) (fh[i+2].im * fSig[i+2].im); fconv[i+2].im = (fh[i+2].re * fSig[i+2].im) + (fh[i+2].im * fSig[i+2].re); fconv[i+3].re = (fh[i+3].re * fSig[i+3].re) (fh[i+3].im * fSig[i+3].im); fconv[i+3].im = (fh[i+3].re * fSig[i+3].im) + (fh[i+3].im * fSig[i+3].re); } fconv[i+1].re = (fh[i+1].re * fSig[i+1].re) (fh[i+1].im * fSig[i+1].im); fconv[i+1].im = (fh[i+1].re * fSig[i+1].im) + (fh[i+1].im * fSig[i+1].re); //do inverse FFT nspsCcsFftNip(fconv,tconv,order,NSP_Inv); //Do overlap add for(i=0;i<siglen;i++) signal[i]=(tconv[i]+OldArray[i]); //update storage of 'old' samples for(i=siglen,k=0;i<siglen+implen-1;i++,k++) { OldArray[k]=tconv[i]; OldArray[i]=0; } } //------------------------------------------------------------------void FastFilter::OverAddFir(float *signal, float g) { static unsigned int i,j=0,k; memcpy(Signal,signal,siglen*sizeof(float)); //FFT Real Input Signal nspsRealFftNip(Signal,fSig,order,NSP_Forw); //Do processing in unrolled loop to //usage for(i=0;i<implen;i+=4) { fconv[i].re = (fh[i].re (fh[i].im fconv[i].im = (fh[i].re (fh[i].im fconv[i+1].re = (fh[i+1].re - 336 - maximise pipeline * * * * * fSig[i].re) fSig[i].im); fSig[i].im) + fSig[i].re); fSig[i+1].re) - Appendix (fh[i+1].im fconv[i+1].im = (fh[i+1].re (fh[i+1].im fconv[i+2].re = (fh[i+2].re (fh[i+2].im fconv[i+2].im = (fh[i+2].re (fh[i+2].im fconv[i+3].re = (fh[i+3].re (fh[i+3].im fconv[i+3].im = (fh[i+3].re (fh[i+3].im } fconv[i+1].re = (fh[i+1].re (fh[i+1].im fconv[i+1].im = (fh[i+1].re (fh[i+1].im * * * * * * * * * * * * * * * fSig[i+1].im); fSig[i+1].im) + fSig[i+1].re); fSig[i+2].re) fSig[i+2].im); fSig[i+2].im) + fSig[i+2].re); fSig[i+3].re) fSig[i+3].im); fSig[i+3].im) + fSig[i+3].re); fSig[i+1].re) fSig[i+1].im); fSig[i+1].im) + fSig[i+1].re); //do inverse FFT nspsCcsFftNip(fconv,tconv,order,NSP_Inv); //Do overlap add for(i=0;i<siglen;i++) signal[i]=((1.0f - g) * signal[i]) + (g * (tconv[i]+OldArray[i])); //update storage of 'old' samples for(i=siglen,k=0;i<siglen+implen-1;i++,k++) { OldArray[k]=tconv[i]; OldArray[i]=0; } } //------------------------------------------------------------------#endif - 337 - Appendix //------------------------------------------------------------------//----------------------------WIGFILE.H-----------------------------//------------------------------------------------------------------#ifndef WaveFileH #define WaveFileH #include <windows.h> #include <mmsystem.h> class WigFile { private: HMMIO FileHandle; MMCKINFO FileInfo,CkInfo,CkSubInfo; MMIOINFO IoInfo; long WaveSize,WavRead,InitialOffset; //char FileBuffer[16384]; public: WigFile(); ~WigFile(); void WaveFile(char *FileName); void GetWaveSamples(float *samples, UINT length); void SkipIntoFile(long Skip); void CloseWaveFile(); unsigned long GetWaveSize() {return(WaveSize);}; unsigned long GetWaveRead() {return(WavRead);}; PCMWAVEFORMAT WaveFormat; }; //------------------------------------------------------------------//Function Declarations---------------------------------------------//------------------------------------------------------------------WigFile::WigFile() { } //------------------------------------------------------------------WigFile::~WigFile() { } //------------------------------------------------------------------void WigFile::WaveFile(char *FileName) { FileHandle = mmioOpen(FileName,NULL, MMIO_READ|MMIO_ALLOCBUF); if(FileHandle==NULL){ return; } CkInfo.fccType=mmioFOURCC('W','A','V','E'); if(mmioDescend(FileHandle,&CkInfo, NULL,MMIO_FINDRIFF)) { mmioClose(FileHandle,0); ShowMessage("Invalid WaveFormat for file: " + *FileName); } CkSubInfo.ckid = mmioFOURCC('f','m','t',' '); if(mmioDescend(FileHandle,&CkSubInfo, &CkInfo,MMIO_FINDCHUNK)) { mmioClose(FileHandle,0); ShowMessage("Invalid Format Chunk for file: " - 338 - Appendix + *FileName); } unsigned long n = CkSubInfo.cksize; mmioRead(FileHandle,(LPSTR)&WaveFormat,n); if(WaveFormat.wf.wFormatTag!=WAVE_FORMAT_PCM) { mmioClose(FileHandle,0); ShowMessage(*FileName + " is not a Wave File!"); } mmioAscend(FileHandle,&CkSubInfo,0); CkSubInfo.ckid = mmioFOURCC('d','a','t','a'); if(mmioDescend(FileHandle,&CkSubInfo, &CkInfo,MMIO_FINDCHUNK)) { mmioClose(FileHandle,0); ShowMessage("Could not descend into data chunk: " + *FileName); } WavRead = 0; WaveSize = CkSubInfo.cksize; InitialOffset = CkSubInfo.dwDataOffset; } //------------------------------------------------------------------void WigFile::GetWaveSamples(float *samples, UINT length) { long c1; short *buf = new short[length]; //Offset file reading by Pos bytes if(FileHandle) { c1 = mmioRead(FileHandle,(char *)buf,length * 2); //Increase wavefile position counter if(c1<=0) WavRead=WaveSize; else WavRead+=c1; if(WavRead<WaveSize) { for(int i=0;i<c1/2;i++) { samples[i] = (float)(buf[i]); } for(int i=c1/2;i<length;i++) { samples[i] = 0.0f; } } if(c1<=0) { if(FileHandle) { mmioClose(FileHandle,0); FileHandle = NULL; } } } else { for(int i=0;i<length;i++) { - 339 - Appendix samples[i] = 0.0f; } } delete[] buf; } //------------------------------------------------------------------void WigFile::SkipIntoFile(long Skip) { long res = mmioSeek(FileHandle,Skip + InitialOffset,SEEK_SET); WavRead = res - InitialOffset; } void WigFile::CloseWaveFile() { if(FileHandle) mmioClose(FileHandle,0); FileHandle=NULL; } #endif - 340 - Appendix //------------------------------------------------------------------//---------------------------WIGAMBI.H------------------------------//------------------------------------------------------------------#ifndef WigAmbiH #define WigAmbiH #include <math.h> #include <mmsystem.h> #ifndef nsp_UsesTransform extern "C" { #define nsp_UsesTransform #include "nsp.h" } #endif struct AmbiBuffer { float *W,*X,*Y,*Z,*R,*S,*T,*U,*V; int Length; bool Order; }; void DeInterlace(WAVEHDR *,float **,int NoOfChannels); void ReInterlace(WAVEHDR *,float **,int NoOfChannels); void BGain(AmbiBuffer *,float Gain); void BRotate(AmbiBuffer *,float RadAngle); void BTilt(AmbiBuffer *,float RadAngle); void Mono2B(float *Mono,AmbiBuffer *,float RadAzim, float RadElev); void BPlusB(AmbiBuffer *,AmbiBuffer *); void AssignChannel(AmbiBuffer *,float *,char); AmbiBuffer * AmbiAllocate(int Length,bool Order,bool WithChannels); void AmbiFree(AmbiBuffer *); float ** AllocDecodeArray(int NoOfSpeakers,bool Order); float ** AllocSampleBuffer(int Channels,int BufferLength); void FreeDecodeArray(float **,bool Order); void FreeSampleBuffer(float **,int Channels); void DecoderCalc(float *Azim,float *Elev,int NoOfSpeakers,bool Order, float WGain,float **Gains); void B2Speakers(float **SGains,AmbiBuffer *Ambi, float **Samples, int NoOfChannels,int NoOfSpeakers,bool Order); float MaxSample(float *Samples,int BufferLength); void MaxSample(WAVEHDR *,float *,int BufferLength,int NoOfChannels); //---------------------------------------------------------------float MaxSample(float *Samples,int BufferLength) { float Max=0; for(int i=0;i<BufferLength;i++) if(Max<Samples[i]) Max=Samples[i]; return (Max); } //---------------------------------------------------------------void MaxSample(WAVEHDR *pWaveHeader,float *Max,int BufferLength, int NoOfChannels) { for(int i=0;i<NoOfChannels;i++) Max[i]=0; short *Data=(short *)pWaveHeader->lpData; for(int i=0;i<BufferLength;i++) { for(int j=0;j<NoOfChannels;j++) { - 341 - Appendix if(Max[j]<(float)Data[j]) Max[j]=(float)Data[j]; } Data+=NoOfChannels; } } //---------------------------------------------------------------void DeInterlace(WAVEHDR *WaveBuffer,float **Samples, int NoOfChannels) { //Sort out channels short *Buffer = (short *)WaveBuffer->lpData; int count=0; for(unsigned int i=0; i<WaveBuffer->dwBufferLength/(2*NoOfChannels);i++) { for(int j=0;j<NoOfChannels;j++) { Samples[j][i]=Buffer[count++]; } } } //---------------------------------------------------------------void ReInterlace(WAVEHDR *WaveBuffer,float **Samples, int NoOfChannels) { //Sort out channels short *Buffer = (short *)WaveBuffer->lpData; int count=0; for(unsigned int i=0; i<WaveBuffer->dwBufferLength/(2*NoOfChannels);i++) { for(int j=0;j<NoOfChannels;j++) { Buffer[count++]=(short)Samples[j][i]; } } } //---------------------------------------------------------------void BRotate(AmbiBuffer *a,float RadAngle) { float x,y; float s = sin(RadAngle); float c = cos(RadAngle); for(int i=0;i<a->Length;) { x = a->X[i] * c + a->Y[i] * s; y = a->Y[i] * c + a->X[i] * s; a->X[i] = x; a->Y[i] = y; i++; } } void BTilt(AmbiBuffer *a,float RadAngle) { float x,z; float s = sin(RadAngle); float c = cos(RadAngle); for(int i=0;i<a->Length;) { x = a->X[i] * c - a->Z[i] * s; z = a->Z[i] * c + a->X[i] * s; - 342 - Appendix a->X[i] = x; a->Z[i] = z; i++; } } void BGain(AmbiBuffer *Ambi, float Gain) { if(Ambi->Order) { for(int i=0;i<Ambi->Length;i++) { Ambi->W[i]*=Gain; Ambi->X[i]*=Gain; Ambi->Y[i]*=Gain; Ambi->Z[i]*=Gain; Ambi->R[i]*=Gain; Ambi->S[i]*=Gain; Ambi->T[i]*=Gain; Ambi->U[i]*=Gain; Ambi->V[i]*=Gain; } } else { for(int i=0;i<Ambi->Length;i++) { Ambi->W[i]*=Gain; Ambi->X[i]*=Gain; Ambi->Y[i]*=Gain; Ambi->Z[i]*=Gain; } } } //---------------------------------------------------------------void Mono2B(float *Mono,AmbiBuffer *Ambi,float RadAzim, float RadElev) { float SinA=sin(RadAzim); float CosA=cos(RadAzim); float SinE=sin(RadElev); float CosE=cos(RadElev); float Sin2E=sin(2*RadElev); float Sin2A=sin(2*RadAzim); float Cos2A=cos(2*RadAzim); float Sample,Gain[9]; Gain[0] = 0.70710678119f; Gain[1] = CosA * CosE; Gain[2] = SinA * CosE; Gain[3] = SinE; if(Ambi->Order) { Gain[4] = 1.5f*SinE*SinE-0.5f; Gain[5] = CosA*Sin2E; Gain[6] = SinA*Sin2E; Gain[7] = Cos2A*CosE*CosE; Gain[8] = Sin2A*CosE*CosE; for(int i=0;i<Ambi->Length;i++) { Sample=Mono[i]; Ambi->W[i]=Sample*Gain[0]; - 343 - Appendix Ambi->X[i]=Sample*Gain[1]; Ambi->Y[i]=Sample*Gain[2]; Ambi->Z[i]=Sample*Gain[3]; Ambi->R[i]=Sample*Gain[4]; Ambi->S[i]=Sample*Gain[5]; Ambi->T[i]=Sample*Gain[6]; Ambi->U[i]=Sample*Gain[7]; Ambi->V[i]=Sample*Gain[8]; } } else { for(int i=0;i<Ambi->Length;i++) { Sample=Mono[i]; Ambi->W[i]=Sample*Gain[0]; Ambi->X[i]=Sample*Gain[1]; Ambi->Y[i]=Sample*Gain[2]; Ambi->Z[i]=Sample*Gain[3]; } } } //---------------------------------------------------------------void BPlusB(AmbiBuffer *Ambi1,AmbiBuffer *Ambi2) { if(Ambi1->Order && Ambi2->Order) { for(int i=0;i<Ambi1->Length;i++) { Ambi2->W[i]+=Ambi1->W[i]; Ambi2->X[i]+=Ambi1->X[i]; Ambi2->Y[i]+=Ambi1->Y[i]; Ambi2->Z[i]+=Ambi1->Z[i]; Ambi2->R[i]+=Ambi1->R[i]; Ambi2->S[i]+=Ambi1->S[i]; Ambi2->T[i]+=Ambi1->T[i]; Ambi2->U[i]+=Ambi1->U[i]; Ambi2->V[i]+=Ambi1->V[i]; } } else { for(int i=0;i<Ambi1->Length;i++) { Ambi2->W[i]+=Ambi1->W[i]; Ambi2->X[i]+=Ambi1->X[i]; Ambi2->Y[i]+=Ambi1->Y[i]; Ambi2->Z[i]+=Ambi1->Z[i]; } } } //---------------------------------------------------------------AmbiBuffer * AmbiAllocate(int Length,bool Order,bool WithChannels) { AmbiBuffer *Ambi; Ambi = new AmbiBuffer; if(WithChannels) { Ambi->W = new float[Length]; memset(Ambi->W,0,sizeof(float)*Length); Ambi->X = new float[Length]; - 344 - Appendix memset(Ambi->X,0,sizeof(float)*Length); Ambi->Y = new float[Length]; memset(Ambi->Y,0,sizeof(float)*Length); Ambi->Z = new float[Length]; memset(Ambi->Z,0,sizeof(float)*Length); if(Order) { Ambi->R = new float[Length]; Ambi->S = new float[Length]; Ambi->T = new float[Length]; Ambi->U = new float[Length]; Ambi->V = new float[Length]; } } Ambi->Length=Length; Ambi->Order=Order; return(Ambi); } //---------------------------------------------------------------void AmbiFree(AmbiBuffer *Ambi) { if(Ambi->W) delete [] Ambi->W; if(Ambi->X) delete [] Ambi->X; if(Ambi->Y) delete [] Ambi->Y; if(Ambi->Z) delete [] Ambi->Z; if(Ambi->R && Ambi->Order) delete [] Ambi->R; if(Ambi->S && Ambi->Order) delete [] Ambi->S; if(Ambi->T && Ambi->Order) delete [] Ambi->T; if(Ambi->U && Ambi->Order) delete [] Ambi->U; if(Ambi->V && Ambi->Order) delete [] Ambi->V; delete Ambi; } //---------------------------------------------------------------void AssignChannel(AmbiBuffer *Ambi,float *Samples,char Channel) { switch (Channel) { case 'W': Ambi->W=Samples; break; case 'X': Ambi->X=Samples; break; case 'Y': Ambi->Y=Samples; break; case 'Z': Ambi->Z=Samples; break; case 'R': Ambi->R=Samples; break; case 'S': Ambi->S=Samples; break; case 'T': Ambi->T=Samples; break; case 'U': Ambi->U=Samples; break; - 345 - Appendix case 'V': Ambi->V=Samples; break; default: break; } } //---------------------------------------------------------------float ** AllocSampleBuffer(int Channels, int BufferLength) { float **Samples; int Rows,Cols; Rows=Channels; Cols = BufferLength; Samples = new float*[Rows]; for (int i=0;i<Rows;i++) Samples[i] = new float[Cols]; return(Samples); } //---------------------------------------------------------------void FreeSampleBuffer(float **Samples,int Channels) { int Rows; Rows = Channels; for (int i = 0; i < Rows; i++) delete[] Samples[i]; delete[] Samples; } //---------------------------------------------------------------float ** AllocDecodeArray(int NoOfSpeakers,bool Order) { float **Gains; int Rows,Cols; Order?Rows=9:Rows=4; Cols = NoOfSpeakers; Gains = new float*[Rows]; for (int i=0;i<Rows;i++) Gains[i] = new float[Cols]; return (Gains); } //---------------------------------------------------------------void FreeDecodeArray(float **Gains,bool Order) { int Rows; Order?Rows=9:Rows=4; for (int i = 0; i < Rows; i++) delete[] Gains[i]; delete[] Gains; } //---------------------------------------------------------------void DecoderCalc(float *Azim,float *Elev,int NoOfSpeakers,bool Order, float WGain, float **Gains) { float SinA,CosA,SinE,CosE,Sin2E,Sin2A,Cos2A; if(Order) { //Create 2 dimensional coefs array for(int i=0;i<NoOfSpeakers;i++) { SinA=sin(Azim[i]); - 346 - Appendix CosA=cos(Azim[i]); SinE=sin(Elev[i]); CosE=cos(Elev[i]); Sin2E=sin(2*Elev[i]); Sin2A=sin(2*Azim[i]); Cos2A=cos(2*Azim[i]); Gains[0][i] Gains[1][i] Gains[2][i] Gains[3][i] Gains[4][i] Gains[5][i] Gains[6][i] Gains[7][i] Gains[8][i] = = = = = = = = = 0.5*(WGain); 0.5*(CosA * CosE); 0.5*(SinA * CosE); 0.5*(SinE); 0.5*(1.5f*SinE*SinE-0.5f); 0.5*(CosA*Sin2E); 0.5*(SinA*Sin2E); 0.5*(Cos2A*CosE*CosE); 0.5*(Sin2A*CosE*CosE); } } else { for(int i=0;i<NoOfSpeakers;i++) { SinA=sin(Azim[i]); CosA=cos(Azim[i]); SinE=sin(Elev[i]); CosE=cos(Elev[i]); Gains[0][i] Gains[1][i] Gains[2][i] Gains[3][i] = = = = 0.5*(WGain); 0.5*(CosA * CosE); 0.5*(SinA * CosE); 0.5*(SinE); } } } //---------------------------------------------------------------void B2Speakers(float **SGains,AmbiBuffer *Ambi, float **Samples,int NoOfChannels, int NoOfSpeakers,bool Order) { for(int i=0;i<Ambi->Length;i++) { for(int j=0;j<NoOfSpeakers && j<NoOfChannels;j++) { if(Order) { Samples[j][i]=Ambi->W[i]*SGains[0][j] +Ambi->X[i]*SGains[1][j] +Ambi->Y[i]*SGains[2][j] +Ambi->Z[i]*SGains[3][j] +Ambi->R[i]*SGains[4][j] +Ambi->S[i]*SGains[5][j] +Ambi->T[i]*SGains[6][j] +Ambi->U[i]*SGains[7][j] +Ambi->V[i]*SGains[8][j]; } else { Samples[j][i]=Ambi->W[i]*SGains[0][j] +Ambi->X[i]*SGains[1][j] +Ambi->Y[i]*SGains[2][j] +Ambi->Z[i]*SGains[3][j]; - 347 - Appendix } } } } #endif - 348 -

Log In

An Investigation Into the Real-Time Manipulation and Control of Three-Dimensional Sound Fields

Related papers

Related papers

Related topics