A Methodology for Virtualizing Complex Sound Sources into 6DoF Recordings
Jeffrey M. Clark
Ball State University
jmclark6@bsu.edu
ABSTRACT
Recording sound sources for spatially adaptive, immersive
environments can present a problem in cases where the listener is able to move into close enough proximity to the
sound source that the source is no longer a point source.
This paper presents a recording and encoding methodology for representing these complex sound sources in six
degree of freedom immersive environments in a way that
allows them to retain their perceptual size based on the
listener’s distance.
This paper also suggests a method for using statistical directivity data to retain a sound source’s frequency-domain
signature based on its directivity characteristics and its rotation relative to the listener. It also suggests a method
for calculating the damping of a sound based on a listener’s distance that accounts for the perceptual size of
the source as it approaches or leaves appearing as a point
source. The damping and directivity methods suggested
can be applied to simple sound sources as well; allowing
the methodology to be applied to all the virtualized sound
objects in a scene using the same approach.
1. INTRODUCTION
1.1 Complex Sound Sources
Recording a sound source using a single microphone and
then spatializing it encodes it as a point source within the
listener’s perceptual soundfield. In many use cases, this
is not a problem as the sound source is simple in nature,
the listener is not able to gain perceptual proximity to the
source to a point in a way that exposes the discrepancy between the perceived and ideal perceptual size of the sound
source, or the added level of auditory detail is undesired
[1]. However, in some cases the user is able to gain proximity and the illusion is threatened.
Depending on the use case, the usual application of a
stereo or binaural recording technique [2] allows a larger
sound source to retain some of its detail through phantom
imaging. In others, stereo phantom imaging is not sufficient. For sound sources of this complexity, a more robust
technique is desired.
An example of a complex sound source case would be a
harp in the center of a scene. As the listener approaches
the instrument in the scene, the perceptual size of the instrument grows. Due to the size of the instrument and the
Copyright: ©2020 Jeffrey M. Clark et al. This is an open-access article
distributed under the terms of the Creative Commons Attribution License
3.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
complex nature of its parts, it is not possible to set up a
stereo pair of microphones that will capture the required
level of detail in the soundboard, strings, and column; that
will also allow the listener to discern the appropriate localization of those details on the instrument.
1.2 Conceptual Framework
Emitter level weighting can be used for simple stereo imaging, as well as be used for more complex use cases in 3D
[3]. The weighting of source arrays has also been used
within immersive audio to assist in highlighting detail for
a listener [4]. Building on this idea, the methodology in
this paper considers a virtual sound source (referred to as
an ‘’instrument“) as a collection of related emitters, which
allows for more spatial detail of the sound source to be
represented within the virtual scene. For a simple sound
source, a single emitter may be sufficient; however, more
complex sources may require more emitters. The emitters
of an instrument are used to define the acoustic perception
of the instrument.
Procedurally, interpolation between the emitters provides
the spatialization in the near field as the perceptual size of
the instrument becomes larger than a point source. The
listener’s distance from the instrument provides the damping level as the listener leaves the near-field. Finally, the
rotation of the instrument and its facing relative to the listener determines the frequency filtering based on the instrument’s statistical directivity data.
2. METHODOLOGY
2.1 Recording
Emitter recording should be done in a controlled environment of sufficient size and/or acoustic treatment so that
reflections from the room do not interfere in the recording process. Because the spatial image of the instrument
is constructed from the composite of the recorded emitter points, care should be taken in microphone choice to
prevent the characteristics of one microphone from overly
influencing critical frequency bands and disrupting the spatial image.
As the instrument is theoretically freely rotatable in three
dimensions relative to the listener, it is possible that the
listener can position themselves in a way allows for emitters to form a line and for phase interferences to manifest.
Thus, it is imperative that the positions of microphones
be chosen with the potential for destructive phase interference under consideration. The placement of microphones
should also take into consideration coverage of the instrument (to prevent perceptual gaps) and points of acoustic
interest.
Signal Input
ri =
Emitter Gain Scale
q
2 (1)
(xi − xL )2 + (yi − yL )2 + (zi − zL )2 + rL
For the coefficient a, the distance-ratio damping variable
R needs to be defined. Since the methodology assumes
that the reverberant field is being handled separately, it can
be assumed that R = 6.
Distance Attenuation
a = 10
Directivity Filtering
Spatial Encoding
−R
20
(2)
The k coefficient can be calculated using the emitter distances and a.
2a
k = qP
N
(3)
−2
i=0 ri
This allows the calculation of the volume scalar for each
emitter vi .
P
P
to Soundfield
Figure 1. Processing Block Example
To facilitate perceptually coherent interpolation between
the recorded emitters, it is necessary to calibrate the microphones so that their sensitivity and gain staging are matched.
If using different microphone models or pre-amplifiers, it
is not sufficient to simply match the gains. If the data for
the recording equipment is known and reliable, then the
gain settings can be derived mathematically. Otherwise it
is advisable to set the gain staging using a noise generator placed equidistant from each microphone so that the
recorded input levels are the same.
k
(4)
2ri a
As can be seen, if rL = 0, then if L occupies the same
position as emitter i, then ri = 0 causing vi to return an
error. In this case the error will need to be accounted for in
code.
vi =
2.3 Distance Damping
The distance between the listener the and the instrument
rI can be taken as either the distance to the closest emitter
or some appropriate point on the virtual visual asset representing the instrument. This distance can be found using
the typical distance equation. Its perceptual size sI can be
found by finding the widest angle between any two emitters.
If emitters i and j are the two widest emitters, then sI can
be found:
sI = |αi − αj |
(5)
Distance damping in a free field is usually given as [6]:
2.2 Emitter Interpolation
At nearfield distances, interpolation and localization of the
emitters will determine the quality of the instrument’s sound
within the listener’s soundfield. The concept of an instrument as a collection of emitters lends itself well to the
application of a modified version of Distance-Based Amplitude Panning (DBAP) [5]. DBAP also has the benefit
of equalizing the levels of the emitters as the distance between the listener and the emitters increases.
Following the DBAP algorithm, first the distance between
the listener and each emitter must be calculated. So, the radial distance r of the ith emitter i which has coordinates (x,
y, z) from the listener L is found using the usual distance
equation. It is additionally preferable to add a radius to the
listener rL . This prevents a single emitter from overloading the entire perceptual image of the instrument at very
close distances, and also prevents a potential division-byzero error. Initial testing has shown that setting rL to a
distance approximating the interaural distance yields acceptable, perceptual results and serves as a good point of
departure for further refinement.
r
(6)
r0
LW is the sound power measured near the source (usually at 1m), with r being the reference distance and r0 the
distance correlating to where LW was measured. The audio recording encodes the LW and r0 information, thus a
damping factor for the instrument vI in decibels can be
found as:
Lp = LW − 20 log10
vI = −20 log10 rI
(7)
To account for the instrument’s divergence from being
a point source in the near field, the vI equation can be
rewritten using sI to interpolate between a -3dB/distancedoubling in the near-field and a -6dB/distance-doubling as
the perceptual size approaches a point source:
sI
) log10 rI
(8)
π
It should be noted that equation 8 assumes that sI is given
in radians. If the program in use returns degrees, then Sπi
sI
.
should be replaced with 180
vI = −(20–10
2.4 Directivity Factor
The statistical directivity factor for an instrument can be
incorporated into the calculation. This information can be
taken from existing measurements, or new measurements
can be made and used [7]. Applying the directivity index
DI is additive with the damping with distance [6]; however, it may be desirable for some complex instruments
to apply the directivity index when the listener leaves the
near-field, to preserve the detail of the emitters.
vI = −(20–10
sI
s
) log10 rI + DI
π
π
(9)
2.5 Frequency Dependence
A more accurate representation of an instrument’s directivity index can be made by applying it to multiple frequency
bands. Including frequency dependence also allows for the
inclusion of air absorption by distance for the sound source
as well. Air absorption can be taken by multiplying the radial distance between the listener and the instrument rI by
the absorption coefficient of air A 1 .
For a specified frequency f , the frequency-specific volume scalar (in decibels) vIf can be found:
vIf = −(20–10
sI
sI
) log10 rI + DIf
+ rI Af
π
π
Resonance and the Resonance-Fmod integration. Testing
was done using both HTC Vive and an Oculus Quest head
mounted displays, with external, studio headphones used.
3.1 Harp
The harp recording was done using and array of true, small
diaphragm omnidirectional microphones. Due to the complexity of the harp’s acoustic radiation, Meyer does not
provide statistical directivity information [6], and a larger
search of the literature was not able to produce broadband
measurements. An attempt was made to gain the desired
directivity data from the harp using an array of microphones,
however the recording space available was too small and
room reflections rendered the captured directivity data unusable.
The main recording microphones were placed near the
top and bottom of the column, towards the end of the arch
near the top of the soundboard, at the center of the soundboard, and one microphone towards the center of the strings.
After testing, it was determined that a higher microphone
density would be desirable for future recordings, with additional coverage along the arch, soundboard and strings.
The microphones were placed 6in from the harp, and were
gain-matched digitally.
(10)
Frequency dependence can be applied through the use of
a filter bank. The center frequencies of each filter can be
set based on the directivity indices of the instrument. The
index for any given direction can be calculated for any frequency through bilinear interpolation on a table of indices
sorted by azimuth and elevation of rotation.
It is possible to use the filter bank to perform all of the
gain adjustments, if applied to each emitter as vif 2 :
vif = vi − (20–10
sI
sI
) log10 rI + DIf
+ rI Af (11)
π
π
Alternatively, vI and vi can be applied as scalars to each
emitter, with the filter bank only used to apply the frequencydependent DIf and ri Af .
3. PROOF OF CONCEPT
As a proof of concept, three recordings of two different
instruments were made using techniques taken from this
method. As examples of complex instruments, a piano
and harp were chosen for recording. Each instrument was
encoded in a virtual scene that allowed the listener spatial freedom to move around the instrument in the virtual
space. An additional virtual scene was created for the piano putting the listener “inside” of the piano, allowing the
listener to walk around on the soundboard of the instrument.
The visual scenes were created in Unity. Fmod was used
for audio handling, with C# scripting in Unity used to automate parameters passed to Fmod according to the outlined methodology. Spatialization was done using Google
1 Be sure to make sure that the absorption coefficient used matches the
units used for distance
2 Noting to convert v to decibels: 20 log 0v
i
i
1
Figure 2. Emitter locations on harp
The emitters were placed in the virtual harp based on the
real microphone locations. The recording was reviewed by
multiple, trained listeners and the feedback was positive.
One listener, a harpist, commented that the experience of
moving around the instrument mirrored their expectations
gained from interactions with real instruments. The spatial
imaging of the sounds on the harp was well-received, although several perceptual holes and weaknesses were discovered that could be revised in future recordings. The
sensation of “walking away” from the instrument was preserved, although the lack of a reverberant field was commented on by one reviewer.
3.2 Piano
The piano recording was made using a mesh of high-end,
true omnidirectional, condenser microphones. The microphones were placed 8in from the strings. Due to the larger
number of microphones used, multiple models of microphone and preamplifier were used, and the calibration process had to be carefully considered. The microphones were
placed around the piano, balancing points of interest within
each area against the desire to have adequate and roughly
even coverage.
The calibration was done digitally, by recording white
noise from a generator into each microphone and setting
the gain digitally (as a matter of logistics). White noise
was used based on the predicted variation in the frequency
response between microphone and preamplifier models.
Due to differences in proportion between the piano used
for the recording and the virtual piano model, the emitter
locations were placed perceptually using an iterative listening process. The final emitter array became more regular
than the original microphone placement.
perceptual gaps. Furthermore, the lack of any sound beyond the emitters was apparent as the listener approach or
passed the edge of the emitter array.
It was theorized that an additional set of microphones
placed at the edges of the piano would aid in extending the
boundary past the listener’s natural area of motion. This
was not implemented or tested due to time constraints.
4. CONCLUSIONS
Preliminary testing for this methodology has been encouraging. While this method is not light-weight, it does have
perceptual benefits for critical listening applications within
spatially adaptive immersive recordings. Implementation
in virtual reality environments can be done using commerciallyavailable tools. Future planning for this methodology is focusing on the creation of highly detailed audio recordings
where the listener is able to freely translate their position
relative to the instruments within the virtual space. Additionally, collaborations with composers are being developed to create new immersive musical experiences that explore the placement and movement of sound sources within
immersive environments.
Further testing needs to be done with a wider variety of
sound sources and materials, and a series of best practices
for varying instruments and conditions needs to be developed. Also, the current implementations lacked any significant reverberation and room modeling. Experimentation needs to be done to identify best-practices for creating aesthetically-desirable reverberant fields as a part of
the immersive experience.
Acknowledgments
Figure 3. View inside piano, showing emitter locations
During listening sessions, the effect of the piano at close
proximity was effective. Listeners were able to put their
heads inside of the piano’s lid and hear the resonances
throughout the soundboard. Feedback from listeners was
positive. However, the recorded material lacked in virtuosity and variety of texture, and further listening with different material may yield different results.
Additionally, a modified version of this algorithm was
used with the piano recording. The piano was scaled, and
the listener was moved to the inside of the instrument, giving the sensation of walking around the soundboard. The
emitters were raised slightly during listening, and the radius of the listener, rL was scaled proportionately to the
piano.
While the listener remained inside of the emitter array,
the effect was convincing. As the listener moved towards
the edges of the mesh, perceptual gaps appeared. This was
able to be compensated for by dynamically setting the angular “size” of each emitter within the soundfield based on
the angular and radial distance of the other emitters around
it. An angular gating method was implemented that would
damp the output of emitters that were occluded by closer
emitters, based on the angular resolution of the ambisonic
layer. These improvements aided, but did not eliminate the
I would like to thank harpist Annie King and pianist Kelsea
Batson for offering their time and talent to play for the
recordings. I would also like to thank Dr. Michael Pounds
for his technical advice and feedback throughout the process, and Jeff Seitz for offering feedback regarding microphone selection and placement and his aid in securing the
equipment and spaces needed for the recording sessions.
5. REFERENCES
[1] W.-P. Brinkman, A. Hoekstra, and R. van Egmond,
“The effect of 3d audio and other audio techniques on
virtual reality experiance,” Studies In Health Technology And Informatics, vol. 219, pp. 44–48, 2015.
[2] W. Zhang, P. Samarasinghe, H. Chen, and T. Abhayapala, “Surround by sound: A review of spatial audio
recording and reproduction,” Applied Sciences, vol. 7,
no. 5, 2017.
[3] T. Kimura and H. Ando, “3d audio system using multiple vertical panning for large-screen multiview 3d
video display,” ITE Transactions on Media Technology
and Applications, vol. 2, no. 1, pp. 33–45, 2014.
[4] J. Janer, E. Gomez, A. Martorell, M. Miron, and
B. de Wit, “Immersive orchestras: Audio processing
for orchestral music vr content,” in 8th International
Conference on Games and Virtual Worlds for Serious
Applications, 2016, Conference Proceedings.
[5] T. Lossius, P. Baltazar, and T. de la Hogue, “Dbap
- distance-based amplitude panning,” in ICMC, 2009,
pp. 489–492.
[6] J. Meyer, Acoustics and the Performance of Music,
5th ed., ser. Modern Acoustics and Signal Processing.
Springer, 2009.
[7] J. Patynen and T. Lokki, “Directivities of symphony orchestra instruments,” Acta Acustica United with Acustica, vol. 96, pp. 138–167, 2010.