A Methodology for Virtualizing Complex Sound Sources into 6DoF Recordings

Jeffrey Clark

A Methodology for Virtualizing Complex Sound Sources into 6DoF Recordings Jeffrey M. Clark Ball State University jmclark6@bsu.edu ABSTRACT Recording sound sources for spatially adaptive, immersive environments can present a problem in cases where the listener is able to move into close enough proximity to the sound source that the source is no longer a point source. This paper presents a recording and encoding methodology for representing these complex sound sources in six degree of freedom immersive environments in a way that allows them to retain their perceptual size based on the listener’s distance. This paper also suggests a method for using statistical directivity data to retain a sound source’s frequency-domain signature based on its directivity characteristics and its rotation relative to the listener. It also suggests a method for calculating the damping of a sound based on a listener’s distance that accounts for the perceptual size of the source as it approaches or leaves appearing as a point source. The damping and directivity methods suggested can be applied to simple sound sources as well; allowing the methodology to be applied to all the virtualized sound objects in a scene using the same approach. 1. INTRODUCTION 1.1 Complex Sound Sources Recording a sound source using a single microphone and then spatializing it encodes it as a point source within the listener’s perceptual soundfield. In many use cases, this is not a problem as the sound source is simple in nature, the listener is not able to gain perceptual proximity to the source to a point in a way that exposes the discrepancy between the perceived and ideal perceptual size of the sound source, or the added level of auditory detail is undesired [1]. However, in some cases the user is able to gain proximity and the illusion is threatened. Depending on the use case, the usual application of a stereo or binaural recording technique [2] allows a larger sound source to retain some of its detail through phantom imaging. In others, stereo phantom imaging is not sufficient. For sound sources of this complexity, a more robust technique is desired. An example of a complex sound source case would be a harp in the center of a scene. As the listener approaches the instrument in the scene, the perceptual size of the instrument grows. Due to the size of the instrument and the Copyright: ©2020 Jeffrey M. Clark et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License 3.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. complex nature of its parts, it is not possible to set up a stereo pair of microphones that will capture the required level of detail in the soundboard, strings, and column; that will also allow the listener to discern the appropriate localization of those details on the instrument. 1.2 Conceptual Framework Emitter level weighting can be used for simple stereo imaging, as well as be used for more complex use cases in 3D [3]. The weighting of source arrays has also been used within immersive audio to assist in highlighting detail for a listener [4]. Building on this idea, the methodology in this paper considers a virtual sound source (referred to as an ‘’instrument“) as a collection of related emitters, which allows for more spatial detail of the sound source to be represented within the virtual scene. For a simple sound source, a single emitter may be sufficient; however, more complex sources may require more emitters. The emitters of an instrument are used to define the acoustic perception of the instrument. Procedurally, interpolation between the emitters provides the spatialization in the near field as the perceptual size of the instrument becomes larger than a point source. The listener’s distance from the instrument provides the damping level as the listener leaves the near-field. Finally, the rotation of the instrument and its facing relative to the listener determines the frequency filtering based on the instrument’s statistical directivity data. 2. METHODOLOGY 2.1 Recording Emitter recording should be done in a controlled environment of sufficient size and/or acoustic treatment so that reflections from the room do not interfere in the recording process. Because the spatial image of the instrument is constructed from the composite of the recorded emitter points, care should be taken in microphone choice to prevent the characteristics of one microphone from overly influencing critical frequency bands and disrupting the spatial image. As the instrument is theoretically freely rotatable in three dimensions relative to the listener, it is possible that the listener can position themselves in a way allows for emitters to form a line and for phase interferences to manifest. Thus, it is imperative that the positions of microphones be chosen with the potential for destructive phase interference under consideration. The placement of microphones should also take into consideration coverage of the instrument (to prevent perceptual gaps) and points of acoustic interest. Signal Input ri = Emitter Gain Scale q 2 (1) (xi − xL )2 + (yi − yL )2 + (zi − zL )2 + rL For the coefficient a, the distance-ratio damping variable R needs to be defined. Since the methodology assumes that the reverberant field is being handled separately, it can be assumed that R = 6. Distance Attenuation a = 10 Directivity Filtering Spatial Encoding −R 20 (2) The k coefficient can be calculated using the emitter distances and a. 2a k = qP N (3) −2 i=0 ri This allows the calculation of the volume scalar for each emitter vi . P P to Soundfield Figure 1. Processing Block Example To facilitate perceptually coherent interpolation between the recorded emitters, it is necessary to calibrate the microphones so that their sensitivity and gain staging are matched. If using different microphone models or pre-amplifiers, it is not sufficient to simply match the gains. If the data for the recording equipment is known and reliable, then the gain settings can be derived mathematically. Otherwise it is advisable to set the gain staging using a noise generator placed equidistant from each microphone so that the recorded input levels are the same. k (4) 2ri a As can be seen, if rL = 0, then if L occupies the same position as emitter i, then ri = 0 causing vi to return an error. In this case the error will need to be accounted for in code. vi = 2.3 Distance Damping The distance between the listener the and the instrument rI can be taken as either the distance to the closest emitter or some appropriate point on the virtual visual asset representing the instrument. This distance can be found using the typical distance equation. Its perceptual size sI can be found by finding the widest angle between any two emitters. If emitters i and j are the two widest emitters, then sI can be found: sI = |αi − αj | (5) Distance damping in a free field is usually given as [6]: 2.2 Emitter Interpolation At nearfield distances, interpolation and localization of the emitters will determine the quality of the instrument’s sound within the listener’s soundfield. The concept of an instrument as a collection of emitters lends itself well to the application of a modified version of Distance-Based Amplitude Panning (DBAP) [5]. DBAP also has the benefit of equalizing the levels of the emitters as the distance between the listener and the emitters increases. Following the DBAP algorithm, first the distance between the listener and each emitter must be calculated. So, the radial distance r of the ith emitter i which has coordinates (x, y, z) from the listener L is found using the usual distance equation. It is additionally preferable to add a radius to the listener rL . This prevents a single emitter from overloading the entire perceptual image of the instrument at very close distances, and also prevents a potential division-byzero error. Initial testing has shown that setting rL to a distance approximating the interaural distance yields acceptable, perceptual results and serves as a good point of departure for further refinement. r (6) r0 LW is the sound power measured near the source (usually at 1m), with r being the reference distance and r0 the distance correlating to where LW was measured. The audio recording encodes the LW and r0 information, thus a damping factor for the instrument vI in decibels can be found as: Lp = LW − 20 log10 vI = −20 log10 rI (7) To account for the instrument’s divergence from being a point source in the near field, the vI equation can be rewritten using sI to interpolate between a -3dB/distancedoubling in the near-field and a -6dB/distance-doubling as the perceptual size approaches a point source: sI ) log10 rI (8) π It should be noted that equation 8 assumes that sI is given in radians. If the program in use returns degrees, then Sπi sI . should be replaced with 180 vI = −(20–10 2.4 Directivity Factor The statistical directivity factor for an instrument can be incorporated into the calculation. This information can be taken from existing measurements, or new measurements can be made and used [7]. Applying the directivity index DI is additive with the damping with distance [6]; however, it may be desirable for some complex instruments to apply the directivity index when the listener leaves the near-field, to preserve the detail of the emitters. vI = −(20–10 sI s ) log10 rI + DI π π (9) 2.5 Frequency Dependence A more accurate representation of an instrument’s directivity index can be made by applying it to multiple frequency bands. Including frequency dependence also allows for the inclusion of air absorption by distance for the sound source as well. Air absorption can be taken by multiplying the radial distance between the listener and the instrument rI by the absorption coefficient of air A 1 . For a specified frequency f , the frequency-specific volume scalar (in decibels) vIf can be found: vIf = −(20–10 sI sI ) log10 rI + DIf + rI Af π π Resonance and the Resonance-Fmod integration. Testing was done using both HTC Vive and an Oculus Quest head mounted displays, with external, studio headphones used. 3.1 Harp The harp recording was done using and array of true, small diaphragm omnidirectional microphones. Due to the complexity of the harp’s acoustic radiation, Meyer does not provide statistical directivity information [6], and a larger search of the literature was not able to produce broadband measurements. An attempt was made to gain the desired directivity data from the harp using an array of microphones, however the recording space available was too small and room reflections rendered the captured directivity data unusable. The main recording microphones were placed near the top and bottom of the column, towards the end of the arch near the top of the soundboard, at the center of the soundboard, and one microphone towards the center of the strings. After testing, it was determined that a higher microphone density would be desirable for future recordings, with additional coverage along the arch, soundboard and strings. The microphones were placed 6in from the harp, and were gain-matched digitally. (10) Frequency dependence can be applied through the use of a filter bank. The center frequencies of each filter can be set based on the directivity indices of the instrument. The index for any given direction can be calculated for any frequency through bilinear interpolation on a table of indices sorted by azimuth and elevation of rotation. It is possible to use the filter bank to perform all of the gain adjustments, if applied to each emitter as vif 2 : vif = vi − (20–10 sI sI ) log10 rI + DIf + rI Af (11) π π Alternatively, vI and vi can be applied as scalars to each emitter, with the filter bank only used to apply the frequencydependent DIf and ri Af . 3. PROOF OF CONCEPT As a proof of concept, three recordings of two different instruments were made using techniques taken from this method. As examples of complex instruments, a piano and harp were chosen for recording. Each instrument was encoded in a virtual scene that allowed the listener spatial freedom to move around the instrument in the virtual space. An additional virtual scene was created for the piano putting the listener “inside” of the piano, allowing the listener to walk around on the soundboard of the instrument. The visual scenes were created in Unity. Fmod was used for audio handling, with C# scripting in Unity used to automate parameters passed to Fmod according to the outlined methodology. Spatialization was done using Google 1 Be sure to make sure that the absorption coefficient used matches the units used for distance 2 Noting to convert v to decibels: 20 log 0v i i 1 Figure 2. Emitter locations on harp The emitters were placed in the virtual harp based on the real microphone locations. The recording was reviewed by multiple, trained listeners and the feedback was positive. One listener, a harpist, commented that the experience of moving around the instrument mirrored their expectations gained from interactions with real instruments. The spatial imaging of the sounds on the harp was well-received, although several perceptual holes and weaknesses were discovered that could be revised in future recordings. The sensation of “walking away” from the instrument was preserved, although the lack of a reverberant field was commented on by one reviewer. 3.2 Piano The piano recording was made using a mesh of high-end, true omnidirectional, condenser microphones. The microphones were placed 8in from the strings. Due to the larger number of microphones used, multiple models of microphone and preamplifier were used, and the calibration process had to be carefully considered. The microphones were placed around the piano, balancing points of interest within each area against the desire to have adequate and roughly even coverage. The calibration was done digitally, by recording white noise from a generator into each microphone and setting the gain digitally (as a matter of logistics). White noise was used based on the predicted variation in the frequency response between microphone and preamplifier models. Due to differences in proportion between the piano used for the recording and the virtual piano model, the emitter locations were placed perceptually using an iterative listening process. The final emitter array became more regular than the original microphone placement. perceptual gaps. Furthermore, the lack of any sound beyond the emitters was apparent as the listener approach or passed the edge of the emitter array. It was theorized that an additional set of microphones placed at the edges of the piano would aid in extending the boundary past the listener’s natural area of motion. This was not implemented or tested due to time constraints. 4. CONCLUSIONS Preliminary testing for this methodology has been encouraging. While this method is not light-weight, it does have perceptual benefits for critical listening applications within spatially adaptive immersive recordings. Implementation in virtual reality environments can be done using commerciallyavailable tools. Future planning for this methodology is focusing on the creation of highly detailed audio recordings where the listener is able to freely translate their position relative to the instruments within the virtual space. Additionally, collaborations with composers are being developed to create new immersive musical experiences that explore the placement and movement of sound sources within immersive environments. Further testing needs to be done with a wider variety of sound sources and materials, and a series of best practices for varying instruments and conditions needs to be developed. Also, the current implementations lacked any significant reverberation and room modeling. Experimentation needs to be done to identify best-practices for creating aesthetically-desirable reverberant fields as a part of the immersive experience. Acknowledgments Figure 3. View inside piano, showing emitter locations During listening sessions, the effect of the piano at close proximity was effective. Listeners were able to put their heads inside of the piano’s lid and hear the resonances throughout the soundboard. Feedback from listeners was positive. However, the recorded material lacked in virtuosity and variety of texture, and further listening with different material may yield different results. Additionally, a modified version of this algorithm was used with the piano recording. The piano was scaled, and the listener was moved to the inside of the instrument, giving the sensation of walking around the soundboard. The emitters were raised slightly during listening, and the radius of the listener, rL was scaled proportionately to the piano. While the listener remained inside of the emitter array, the effect was convincing. As the listener moved towards the edges of the mesh, perceptual gaps appeared. This was able to be compensated for by dynamically setting the angular “size” of each emitter within the soundfield based on the angular and radial distance of the other emitters around it. An angular gating method was implemented that would damp the output of emitters that were occluded by closer emitters, based on the angular resolution of the ambisonic layer. These improvements aided, but did not eliminate the I would like to thank harpist Annie King and pianist Kelsea Batson for offering their time and talent to play for the recordings. I would also like to thank Dr. Michael Pounds for his technical advice and feedback throughout the process, and Jeff Seitz for offering feedback regarding microphone selection and placement and his aid in securing the equipment and spaces needed for the recording sessions. 5. REFERENCES [1] W.-P. Brinkman, A. Hoekstra, and R. van Egmond, “The effect of 3d audio and other audio techniques on virtual reality experiance,” Studies In Health Technology And Informatics, vol. 219, pp. 44–48, 2015. [2] W. Zhang, P. Samarasinghe, H. Chen, and T. Abhayapala, “Surround by sound: A review of spatial audio recording and reproduction,” Applied Sciences, vol. 7, no. 5, 2017. [3] T. Kimura and H. Ando, “3d audio system using multiple vertical panning for large-screen multiview 3d video display,” ITE Transactions on Media Technology and Applications, vol. 2, no. 1, pp. 33–45, 2014. [4] J. Janer, E. Gomez, A. Martorell, M. Miron, and B. de Wit, “Immersive orchestras: Audio processing for orchestral music vr content,” in 8th International Conference on Games and Virtual Worlds for Serious Applications, 2016, Conference Proceedings. [5] T. Lossius, P. Baltazar, and T. de la Hogue, “Dbap - distance-based amplitude panning,” in ICMC, 2009, pp. 489–492. [6] J. Meyer, Acoustics and the Performance of Music, 5th ed., ser. Modern Acoustics and Signal Processing. Springer, 2009. [7] J. Patynen and T. Lokki, “Directivities of symphony orchestra instruments,” Acta Acustica United with Acustica, vol. 96, pp. 138–167, 2010.

RELATED PAPERS

RELATED TOPICS

Log In

A Methodology for Virtualizing Complex Sound Sources into 6DoF Recordings

A Methodology for Virtualizing Complex Sound Sources into 6DoF Recordings

Related Papers

RELATED PAPERS

RELATED TOPICS